The big deal about InfoSphere BigInsights v3.0 is Big SQL
Today, at IBM Impact 2014, Bob Picciano described why the upcoming release IBM InfoSphere BigInsights v3.0 was a big deal—that is, why its new SQL engine Big SQL was a big deal.
This release is exciting for four reasons:
1. Query performance
2. Comprehensive SQL execution
3. Built-in data security
Now, let’s describe each area further.
Performance: Big SQL vs. Hive
Let’s start with the number one reason why this new release of Big SQL sets a new bar: performance. Benchmark tests indicate that Big SQL executes queries 20 times faster, on average, over Apache Hive 12 with performance improvements ranging up to 70 times faster.
This performance improvement was achieved by replacing the earlier Map-Reduce (MR) implementation with a massively parallel processing (MPP) SQL engine. The MPP engine deploys directly on the physical Hadoop Distributed File System (HDFS) cluster. A fundamental difference from other MPP offerings on Hadoop is that this engine actually pushes processing down to the same nodes that hold the data. Because it natively operates in a shared-nothing environment, it does not suffer from limitations common to shared-disk architectures (for example: poor scalability and networking caused by the need to move shared data around).
Comprehensive SQL: Big SQL vs. Hive
The Transaction Processing Council (TPC) is a non-profit corporation founded to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry. The two benchmarks pertinent to analytic applications is TPC’s decision support benchmark (TPC-DS) which reflects the type of data volumes and queries found in typical analytic environments and TPC’s ad-hoc decision support benchmark (TPC-H) which focuses on query performance and throughput in a concurrent user environment.
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. To contrast, Apache Hive 12 executes only 43 of the 99 queries without modification. In a Jan 2013 blog post, Cloudera describes how its benchmark tests were completed by modifying the TPC-DS queries to SQL-92 syntax and selectively included only 20 of the 99 TPC-DS queries.
- How broad is SQL support in this release? Everything you know about Hive 12 plus the following capabilities:
- Nested sub-queries, including sub-queries with the HAVING clause
- Windowing and OLAP aggregate functions
- Grouping sets and the ROLLUP function
- Complex joins
- INSERT statement
In addition to standard application authentication via LDAP or Kerberos, Big SQL enables row and column access control or what is sometimes described as fine grained control consistent with functionality found in an RDBMS. This functionality supports compliance for regulations and policies related to data privacy, such as patient health records or securities data
To monitor and validate data access, BigInsights’ built-in auditing can track changes to access privileges or data objects and track SQL statement execution and retrieving security information.
Return on investment: Existing SQL skills and data sources
The comprehensive SQL support by Big SQL 3.0 enables an organization to make full use its existing SQL skills, reducing the need to augment its analytic applications with Hadoop-specific functions.
Now here’s the real value: Big SQL 3.0 can access data from more than BigInsights. It can query and combine data from many data sources, including (but not limited to) DB2 for Linux, UNIX and Windows database software, IBM PureData System for Analytics, IBM PureData System for Operational Analytics, Teradata and Oracle. Organizations can choose to leave data where it currently exists and use BigInsights to augment where it makes the most sense.
Note that this approach, minimizing the need to move data, is part of IBM’s overall big data and analytics strategy. SPSS and Cognos Business Intelligence also support querying and joining data across disparate data sources, addressing the need to analyze all data, wherever it is located.
IBM InfoSphere BigInsights v3.0, with the MPP-based performance and SQL support of Big SQL 3.0, provides an enterprise-ready Hadoop distribution that minimizes the impact on users while enabling IT to adopt this new technology into its data architecture strategy.
- Technical White Paper on Big SQL: SQL-on-Hadoop Without Compromise
- Infographic: Big Data, Big Solutions Infographic
- More information about InfoSphere products at Impact 2014