Apply Existing SQL Expertise to Capitalize on Hadoop

When it comes to SQL-on-Hadoop initiatives, architecture matters for enterprise data warehousing

Product Marketing Manager, IBM

While Apache Hive was the only game in town just a few years ago for running SQL–on–Apache Hadoop, today there is a range of solutions. Tracking these efforts and sorting out competing performance and functionality claims can be a full-time job. There are different approaches to SQL-on-Hadoop for organizations to consider, and IBM now offers its IBM® Big SQL MapReduce framework technology as a feature of its IBM InfoSphere® BigInsights™ large-volume data management platform.

Hadoop is well known for its capabilities processing unstructured data, but many organizations are prioritizing projects that involve transactional data and log file data in Hadoop. These data formats are easily handled in tabular formats that are thus accessible using SQL. While skills in Hadoop scripting languages and MapReduce programming may be in short supply, there is a relative abundance of SQL expertise in IT departments. Also, many organizations have made dedicated commitments to SQL-compatible tool deployments. Naturally, many of these organizations would like to take advantage of the flexibility and cost-efficiencies Hadoop can offer, but they want to do so using the tools with which they are already familiar.

Understanding the approaches

American National Standards Institute (ANSI) SQL is a standard, but there are at least a dozen distinct SQL-on-Hadoop initiatives underway. Many of these efforts fit into three broad categories: Hadoop as an extension of relational databases, comprehensive relational database management system (RDBMS) implementations on Hadoop, and SQL-over-native-Hadoop data sources.

Hadoop as a relational database extension

RDBMS and data warehouse vendors that provide mature offerings typically take this first approach. The RDBMS acts as a front end for the Hadoop cluster. End users query the database, and the database in turn sends requests to an associated Hadoop cluster. This approach offers excellent compliance with SQL standards, but performance can vary when querying Hadoop sources because of the multi-hop nature of the architecture. Also, implementations tend to be vendor-specific, with each vendor optimizing solutions for its own database environment. Offerings that fall into this category include the Microsoft Analytics Platform System that leverages SQL Server Parallel Data Warehouse (PDW) and HDInsights, Oracle Big Data SQL4, and Teradata SQL-H.

Comprehensive RDBMS implementations on Hadoop

A second approach involves native database implementations designed to run on a Hadoop cluster. In this approach, not only are vendors providing their own query engines, but they are including their own storage management layer as well. The benefit of this approach is that the SQL-on-Hadoop implementations can stand alone without requiring an associated RDBMS. Fans of open source software may not like the fact that vendors are often introducing proprietary data formats and metastores rather than using established standards for data formats on the Hadoop Distributed File System (HDFS), including Hive and Apache HBase. Implementations in this category include Actian Hadoop SQL Edition and Pivotal HAWQ.

SQL-over-native-Hadoop data sources

A third approach includes implementing massively parallel processing (MPP) SQL query engines that run natively over existing Hadoop data formats and metadata sources such as the Hive metastore. The advantage of this architecture is that it provides compatibility at the Hadoop file system level and leverages existing open standards. Like the second approach, it does not tie an organization to a particular RDBMS or data warehouse technology. It also has the additional advantage of enabling the SQL-on-Hadoop implementation to work side by side with other Hadoop tools for processing data that are already in the HDFS. Examples of SQL-on-Hadoop implementations that fit this model include Big SQL in InfoSphere BigInsights, the Impala database integrated in the Cloudera enterprise data hub, and the Stinger initiative from Hortonworks.

Focusing on capabilities that matter

The good news for organizations is that there is a variety of technologies available that take advantage of the Hadoop framework. The IBM point of view is that the requirements organizations view as essential in mature relational databases will be increasingly important in SQL-on-Hadoop implementations as well. Among these capabilities are:

  • Compatibility: Organizations generally want their SQL-on-Hadoop implementation to support the same SQL language features as their chosen RDBMS and warehouse platform, including features such as subqueries and analytics functions that help ensure portability and compatibility.
  • Performance: The days of manually optimizing query plans are coming to an end. Hadoop implementations that fail to provide sophisticated query optimization and memory management are likely to be less successful in the industry.
  • Security: To be practical, security needs to be more than an afterthought. As deployments proceed to production, support for fine-grained, row and column–based access controls and features such as data masking are essential.
  • Federation: Organizations want to be able to issue queries that access data sets across multiple data sources—both Hadoop and non-Hadoop—using a standard, ANSI-compliant SQL. And this SQL needs to be capable of generating the appropriate query plans automatically, based on the nature of the queries and underlying data sources.

Charting a course

Big SQL is a key technology feature in InfoSphere BigInsights that represents IBM’s commitment to the emerging role of SQL-on-Hadoop solutions. By leveraging its extensive experience engineering databases, IBM is charting a different course from other established RDBMSs and data warehousing offerings through a standards-based, high-performance, SQL-on-Hadoop approach. Big SQL runs natively on Hadoop while preserving existing database and data warehouse environments.

Please share any thoughts or questions in the comments.