SQL-on-Hadoop without compromise

SQL-on-Hadoop without compromise

IBM Software Group Thought Leadership White Paper

How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions

By Scott C. Gray, Fatma Ozcan, Hebert Pereyra, Bert van der Linden and Adriana Zubiri , April 2014

Introduction When considering SQL-on-Hadoop, the most fundamental question is: What is the right tool for the job? For interactive queries that require a few seconds (or even milliseconds) of response time, MapReduce (MR) is the wrong choice. On the other hand, for queries that require massive scale and runtime fault tolerance, an MR framework works well. MR was built for large-scale processing on big data, viewed mostly as “batch” processing. As enterprises start using Apache Hadoop as a central data repository for all data — originating from sources as varied as operational systems, sensors, smart devices, metadata and internal applications — SQL processing becomes an optimal choice. A fundamental reason is that most enterprise data management and analytical tools rely on SQL. As a tool for interactive query execution, SQL processing (of relational data) benefits from decades of research, usage experience and optimizations. Clearly, the SQL skills pool far exceeds that of MR developers and data scientists. As a general-purpose processing framework, MR may still be appropriate for ad hoc analytics, but that is as far as it can go with current technology.

The first version of Big SQL from IBM (an SQL interface to IBM InfoSphere® BigInsightsTM software, which is a Hadoop- based platform) took an SQL query sent to Hadoop and decomposed it into a series of MR jobs to be processed by the cluster. For smaller, interactive queries, a built-in optimizer rewrote the query as a local job to help minimize latencies. Big SQL benefited from Hadoop’s dynamic scheduling and fault tolerance. Big SQL supported the ANSI 2011 SQL standard and introduced Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) client drivers. Big SQL 3.0 from IBM represents an important leap forward. It replaces MR with a massively parallel processing (MPP) SQL engine. The MPP engine deploys directly on the physical Hadoop Distributed File System (HDFS) cluster. A fundamental difference from other MPP offerings on Hadoop is that this engine actually pushes processing down to the same nodes that hold the data. Because it natively operates in a shared-nothing environment, it does not suffer from limitations common to shared-disk architectures (e.g., poor scalability and networking caused by the need to move “shared” data around).
Big SQL 3.0 introduces a “beyond MR” low-latency parallel execution infrastructure that is able to access Hadoop data natively for reading and writing. It extends SQL:2011 language support with broad relational data type support, including support for stored procedures. Its focus on comprehensive SQL support translates into industry-leading application transparency and portability. It is designed for concurrency with automatic memory management and comes equipped with a rich set of workload management tools. Other features include scale out parallelism to hundreds of data processing nodes and scale up parallelism to dozens of cores. With respect to security, it introduces capabilities on par with those of traditional relational data warehouses. In addition, it can access and join with data originating from heterogeneous federated sources. Right now, users have an excellent opportunity to jump into the world of big data and Hadoop with the introduction of Big SQL 3.0. In terms of ease of adoption and transition for existing analytic workloads, it delivers uniquely powerful capabilities.


© Copyright IBM Corporation 2014

You may also like...