Many people want SQL, the query language of the past two decades, to work on Hadoop. Derrick Harris of GigaOM outlined the approaches of 13 different vendors with their cleverly named projects in his Feb 2013 article. Recently, Information Week’s special coverage series on big data included an article on SQL-on-Hadoop with updates on Teradata SQL-H and IBM Big SQL.
So, what’s all the fuss? As Harris wrote:
“SQL support isn’t the end-game for Hadoop, but it’s the feature that will help Hadoop find its way into more places in more companies that understand the importance of next-generation analytics but don’t want to (or can’t yet) re-invent the wheel by becoming MapReduce experts.”
From an analytics perspective, the fuss is mostly about the support for and performance of aggregate functions.
Let’s start with an example to define some terms. Hadoop-based data is usually source or transactional data. To transform this detailed transactional data into metrics and answer business questions, the data is summarized or aggregated. Below are some example business questions, grouped by the type of aggregation:
With these examples in mind, let’s now discuss how SQL and SQL-on-Hadoop compare.
Count and Sum
Count and Sum are the cornerstone functions to analyzing transactional data. I am singling out these two functions from the basic aggregate functions more out of principle than anything else. A number of vendors providing counts and totals on transactional data claim that they deliver “analytics”. Yes, counting and summing are part of analyzing any data. But, do “counts and sums” = “analytics”? Not by a long shot.
Average, Maximum, Minimum, Variance and Standard Deviation are basic aggregate functions, in addition to Count and Sum. These functions are described as basic because these functions process a single data element or column, within a defined set of data, and return a single value. e.g., Average Sales or Maximum Order Amount.
To make these results useful when answering business questions, the results are often performed on a subset of data, sorted and broken out. This processing answers questions such as “average sales by region for the period Jan. 1, 2013 to today.” In conventional SQL, sorting is done with an ORDER BY clause, calculations broken out by a data item are done with a Group BY clause and limiting the subset of data is done with a Where clause.
When discrete data elements are referenced, these clauses are viewed as extensions to basic aggregation. Hive and other SQL-on-Hadoop languages support basic aggregation and clauses.
However, when the aggregation requires understanding one result in context of another result, more advanced techniques are required. It’s at this point that SQL vs. SQL-on-Hadoop separate.
Mileage will Vary
Understanding trends is a significant aspect of running a business and, consequently, a core capability for analytic applications. Calculating trends with detailed, transactional data is not a trivial undertaking. For example, calculating year-over-year growth requires summing Revenue for each year and then calculating the percentage change against the prior year. This ability to reference a relative time period, in this case the “prior year,” is what separates window-based aggregation from basic aggregation.
There are two types of window-based aggregation:
- By a unit of time, such as year or month e.g., Year-over-Year growth
- By sliding or rolling time window e.g., 52-week high, Number of days to expiry
Due to the business need to analyze trends, business analytics software, such as IBM Cognos©, supports this functionality either by pushing to the data store, when supported, or by processing locally.
Apache Hive does not support window-based aggregation which implies extending the Hive query with a user-defined function or leveraging the local processing by the analytic application. Other SQL-on-Hadoop implementations such as BigInsights Big SQL or Cloudera Impala do support window-based aggregation in varying degrees.
Business questions that assess relative performance or reference cross-functional performance indicators are often the most difficult to answer using transactional data. For example, determining the “top 3 stores, by profitability, within each region” first requires calculating the profitability of all stores within a region in order to select those with the highest profit.
Context functions include Rank, Quartile, Percentile, Top, Bottom, Lag, Lead and so forth are supported by mature relational database technologies using techniques such as cursors, partitioning or multiple passes on the data to process the query.
Again, because of the business need to assess performance in terms of relative impact, business analytics software supports contextual aggregation by generating SQL which leverages the data store’s capabilities or by utilizing local processing.
Apache Hive supports some contextual aggregation functions, which again implies developing code or using local processing by the analytic application when it is not. Support by other SQL-on-Hadoop implementations such Aster SQL-H is typically broader.
And Then There’s Performance!
When moving an analytics application to Hadoop, one must remember a few key points:
- Hadoop is batch oriented. See previous blog post on this point.
- Hadoop doesn’t use indexes… at least not at this time
The impact of these points shows up in performance, particularly for more complex query constructs which involve
- A HAVING or OVER clause e.g., “All customers whose life-to-date spend is greater than $1 million”
- Nested or multi-level SQL e.g., “lost time injury rate” where the results from HR claims are used in the calculations from the manufacturing schedules
SQL-on-Hadoop offerings are starting to provide built-in functions for more than the basics but, like the early days of RDBMS, performance and completeness of this SQL will vary. Programming user-defined functions (UDFs), query accelerators or some kind of specialized processing in Hadoop may be required to support all required analytic functions.
When moving your analytics application from an RDBMS to your SQL-on-Hadoop implementation, assess both breadth of capability and fit with data as local processing can adversely impact performance.
Stay tuned! As vendors broaden their SQL-on-Hadoop implementations and RDBMS-to-Hadoop connectors, I expect these SQL gaps to disappear in the next 12-18 months.
More resources for learning, discovery and experimentation
BD111EN (FREE course): This course teaches you how to take advantage of the SQL language to access big data stored in HDFS or HBase using SQL. The course presents the different alternatives for SQL access, such as Hive, Impala, and Big SQL. It explains the similarities and differences between these three technologies.
IBM Big SQL Technology Preview program provides an easy convenient way to get familiar with SQL for Hadoop technology. With nothing more than a web browser you will be running SQL queries against real Hadoop cluster.
Additional educational resources