Post a Comment

Blistering fast SQL access to your Hadoop data

July 28, 2014

SQL has been the de-facto data query language of the past two decades and organizations and individuals have made significant investments in this space to enable a wide range of reporting and analytical tools. Extending this capability to Hadoop preserves and extends those investments even as organizations seek to extract value from new sources of semi-structured and unstructured data. The space of SQL-on-Hadoop has emerged to address this challenge through both open source initiatives such as Apache Hive and individual company research initiatives such as IBM’s Big SQL.

IBM has a rich history with SQL and optimizing complex SQL queries has been the focus of IBM Research for the last three decades. This focus has now broadened to SQL-on-Hadoop as well which led to the development of Big SQL.

Most SQL-on-Hadoop technologies in the marketplace today are limited in their ability to support complex nested decision support queries and are not optimized to exploit parallel execution, which then negatively impacts performance. This is because these technologies only provide basic rewrite transformations, such as selection and projection pushdown, and lack many complicated subquery transformations such as magic decorrelation or subquery to join transformations.

Big SQL, on the other hand, exploits sophisticated query rewrite transformations \cite{pirahesh96, winmagic} that are targeted for complex nested decision support queries. These subquery rewrites transform the query into a version more suitable for parallel execution and hence are critical for good performance. Further, Big SQL exploits autonomic features that optimize the allocation of various buffer pools for best memory utilization.


Big SQL 3.0 pushes the envelope further and provides fast SQL processing over HDFS data by leveraging IBM's state-of-the-art relational database technology, without imposing any database structures or restrictions, hence keeping the flexibility of Hadoop.

Figure 1 above provides the overall system architecture. SQL worker nodes are deployed on the same HDFS cluster. A special head node contains the coordinator, which receives the SQL statements from the client applications and compiles them into distributed query execution plans.

Big SQL 3.0 also introduces a new stand-alone scheduler service which helps coordinate the distributed execution over the HDFS cluster. The scheduler assigns splits to SQL workers for processing and tries to maximize data locality while minimizing overall query execution time. Another important aspect is the dynamic nature of this split assignment. As opposed to using the query optimizer to generate a fixed data access plan at query compile-time, Big SQL 3.0 invokes the scheduler at query run-time to identify the best set of nodes to use for the query.

This dynamic behavior also enables the system to adapt to cluster changes without disrupting the query service. The scheduler only considers the nodes that are up and running for any given query. Hence failed nodes will automatically be dropped from consideration. When new nodes join the cluster, they will automatically be eligible to process queries and will be considered by the scheduler.

Big SQL 3.0 provides special processes, HDFS readers, which are coupled with corresponding SQL worker processes to ingest HDFS data. Big SQL 3.0 shares the same catalog and table definitions with Hive using the Hive Metastore. In addition to the tables created and loaded by Big SQL, it can also work on tables created and loaded by other programs (like Hive and Impala) and can ingest all HDFS file formats, including text, sequence, RCFile and Parquet as well as ORCFile.

In a recent laboratory controlled benchmark derived from a 20TB TPC-DS workload (herein referred to as the Modern BI Workload) Big SQL 3.0 outperformed Hive 0.12 by a factor of 14 times for a single query stream (see figure 2 below). Although BigSQL 3.0 is capable of executing all 99 queries of the Modern BI workload unmodified, Hive 0.12 is only capable of executing 39 of the queries, so the results are based on the common set of 39 queries.


For the single stream test, BigSQL 3.0 can actually execute all 99 queries in the workload 3 times faster that Hive 0.12 can execute the 39 queries. For 10 concurrent query streams, Big SQL 3.0 completed in just over nine hours, whereas Hive 0.12 took over four days to complete (see figure 3 below).


Benchmark numbers often don’t paint the whole picture though. It took several weeks of effort to tune the Hive environment and optimize the queries for this workload. This level of time and effort does not have to be spent optimizing Big SQL 3.0 thanks to its advanced cost based optimizer and workload management.

With its extended SQL support and improved performance, Hive 0.13 is promising another large leap in performance. According to the paper “SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures,” indicators are that Hive 0.13 is approximately 2.3 times faster than Hive 0.12. Extrapolation from the above paper would indicate that Big SQL 3.0 is still approximately 6 times faster than Hive 0.13, and still continues to support a much richer SQL syntax.

SQL-on-Hadoop continues to be a rapidly evolving space, and IBM’s Big SQL takes a comprehensive approach with an eye on performance, federation, security and compatibility with SQL standard, while providing access to non-traditional data for meeting the strategic needs of organizations with existing investments in SQL skills and infrastructure. IBM’s Big SQL is available as part of IBM InfoSphere BigInsights, a Hadoop distribution.   

SimonHarris-HeadShot.jpg Coauthor of this post is Simon Harris. Simon is a big data performance architect working in the IBM BigInsights development team. He has 20 years of experience working in information management including MPP RDBMS, federated database technology, tooling and big data. Simon now specializes in SQL over Hadoop technologies.

"Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the “20TB Modern BI Workload" in a controlled laboratory environment. The 20TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 20TB scale factor.  It is materially equivalent with the exception that no updates are performed, and only 39 out of 99 queries are executed. The test measured sequential query execution of all 39 queries for which Hive syntax was publically available as of June 2014.  TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).

Configuration: Cluster of 40 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3.   Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of June, 2014"