Blogs

SQL-on-Hadoop: Who leads the herd?

Product Marketing Manager, IBM

SQL-on-Hadoop is a hot topic. As organizations increasingly leverage Hadoop to gain new capabilities and lower costs, it only makes sense that they would seek to leverage technologies compatible with what they have. Few technologies are as ubiquitous as the structured query language when it comes to data access and information management.

In an independently audited benchmark of three popular SQL-on-Hadoop implementations overseen by InfoSizing, IBM recently showed that Hadoop is ready to run OLAP and complex query workloads at a fraction of the cost of traditional systems—that is, if you choose the right technology.

herd of elephants.jpgRunning IBM’s Hadoop-DS benchmark, modeled after the highly regarded Transaction Processing Council Decision Support (TPC-DS) benchmark, IBM showed that it was the only vendor able to run all 99 Hadoop-DS queries, thanks to the strict ANSI SQL compliance of IBM’s Big SQL, a component of IBM’s enterprise-grade Hadoop offering, IBM InfoSphere BigInsights.

In addition to being fully SQL compatible, IBM Big SQL was found to be more scalable and deliver better performance as well. Big SQL was the only Hadoop-on-SQL offering tested able to run the Hadoop-DS benchmark at a 30 TB scale. Additionally, when comparing IBM Big SQL to Cloudera Impala 1.4.1 and HortonWorks Hive 0.13 across the common subset of 46 queries both were able to run, IBM performed better in both single-user as well as multi-user tests.

Key findings

  • Compatibility matters: IBM’s Big SQL was able to execute all 99 TPC-DS queries with 12 queries requiring only minor modifications permissible under the TPC-DS benchmark rules. Cloudera’s Impala was able to run 52 queries, and Hive ran 58 queries in a manner that complied with the same requirements. For the remaining queries, all needed to be re-written and some queries could not be made to run at all or would fail with errors or hang at a 10 TB scale test. Getting even a subset of the queries to execute at scale on competing platforms, took several weeks of effort.
  • Throughput matters: When comparing Big SQL across the subset of common queries that ran on all SQL-on-Hadoop implementations test, IBM ran queries on average 3.6 times faster than Impala and 5.4 times faster than Hive on a 10 TB scale test. Not only was Big SQL the only engine able to run the Hadoop-DS workload, it ran the workload significantly faster than the competing implementations as well.
  • Scale matters: While Big SQL could reliably execute all queries at a 30TB scale, and complete the full Hadoop DS workload, the competitive platforms could not. Some queries simply would not run on competing platforms at all and other queries that ran at a small scale would fail running against 10TB or 30 TB datasets, hanging or exhibiting various run-time errors.

Table 1 shows the audited findings of the Hadoop-DS benchmark. A short whitepaper explaining the results in more detail is available here.

 

 

Elapsed time (seconds)

Qph-HDS@10TB

Big SQL Advantage

 

# queries

Single-user

Multi-user

Single-user

Multi-user

Single-user

Multi-user

Big SQL 3.0

46

2,908

6,945

5,694

9,537

 

 

Impala 1.4

46

10,536

14,920

1,571

4,439

3.6

2.1

Hive 0.13

46

15,949

59,550

1,038

1,112

5.4

8.5

Big SQL 3.0

99

32,361

88,764

1,101

2,409

 

 

Impala 1.4

99

Not possible – not all queries could run

Hive 0.13

99

Not possible – not all queries could run

Table 1 – Hadoop DS results – audited by InfoSizing, certified TPC auditors

Why all the fuss?

These findings will be compelling to organizations augmenting data warehouse environments with Hadoop-based technologies. Many organizations simply cannot afford to support different SQL dialects and toolsets between Hadoop and existing data warehouse and RDBMS environments. Being able to use a common SQL across all platforms is critical to speeding deployments, containing costs and avoiding unnecessary business risk.

Not only is IBM’s Big SQL the only Hadoop solution tested able to actually run the complete set of Hadoop-DS queries, but it was found to be the fastest, the most scalable and the most reliable as well. 

IBM invented SQL and brings over 30 years of experience developing query engines, optimizers and high-performance distributed computing environments. IBM Big SQL operates seamlessly over native Hadoop data sources, while providing transparent, federated access to multiple enterprise data management platforms including IBM DB2, Oracle and Teradata.

Participate in the #Hadoop4good challenge and change the world

TPC Benchmark, TPC-DS and QphDS are trademarks of Transaction Processing Performance Council. Cloudera, the Cloudera logo and Cloudera Impala are trademarks of Cloudera. Hortonworks is a trademark of Hortonworks Inc. Hadoop and Hive are trademarks of the Apache Software Foundation. Other company, product and service names may be trademarks or service marks of others.