Berni Schiefer on the Big SQL extension to Hortonworks Data Platform
Berni Schiefer, IBM fellow in the IBM Analytics group, is based in San Francisco, California at the IBM Spark Technology Center. Schiefer is responsible for a global team that focuses on the performance and scalability of products and solutions in the Analytics group. These big data technologies include Apache Spark, IBM Big SQL, dashDB, IBM BigInsights, IBM DB2 with BLU Acceleration and IBM DB2 pureScale.
Schiefer’s passion is in bringing advanced technology to market with an emphasis on exploiting processor, memory, networking, storage and other hardware and software acceleration technologies. Since joining IBM Canada in 1985, he has worked closely with many customers, independent software vendors (ISVs) and business partners worldwide.
Big news about Big SQL
In September 2016, IBM extended Big SQL—formerly exclusive to the IBM Open Platform (IOP)—to the Hortonworks Data Platform (HDP). In my conversation with Schiefer, we discussed more about the offering and the IBM focus on SQL.
You’ve been with IBM for quite a few years and working on IBM Big SQL since its inception. Tell us how you got started working on Big SQL.
For more than 31 years now, I’ve been involved in the development of many different products all centered around persistent data and the query processing space. I have a five-year history with Apache Hadoop. In addition, three years ago I became an IBM founder of a re-architecture of the Big SQL project. We set out to build the world’s most powerful SQL on Hadoop engine capable of meeting the requirements of enterprise-grade SQL applications.
Our starting point was taking some powerful IBM query processing technology that we had already delivered to our customers through relational database engines and worked to extend, reuse and couple this technology to run on Hadoop using native Hadoop formats. Originally, Hadoop meant the MapReduce execution model running over data stored on the Hadoop file system known as [Hadoop Distributed File System] HDFS. Today, Hadoop is an entire ecosystem for big data. Our goal was to retain the persistence model of Hadoop, including its resiliency, elasticity and flexible data formats—such as delimited ASCII, Apache ORC and Parquet—but replace MapReduce with a sophisticated new runtime engine that was more powerful and efficient. A little more than two years ago, with Big SQL Version 3, we delivered the world’s most powerful SQL over Hadoop engine. This technology brings a very powerful way to query and analyze the content of Hadoop clusters.
Is Big SQL still a leading SQL on Hadoop engine?
Absolutely. Big SQL has more function and much greater performance than any other SQL on Hadoop engine. SQL is a powerful programming language, but one of its most distinguishing features is that you specify what data you want to retrieve but not how one should go about doing it. To figure the how takes a sophisticated query optimizer. One of the key features of Big SQL is that it brings with it decades of SQL research and development, going back all the way to IBM inventing the SQL language. And it goes back to the late 1970s and Dr. Patricia Selinger’s concept of a query optimizer that performs access path selection. We’ve brought all that technology and made it available in Big SQL for Hadoop.
Hortonworks HDP is aligned with [Open Data Platform initiative] ODPi. Both Hortonworks and IBM are active members of the ODPi consortium and IOP 4.2 and HDP 2.4 are both ODPI compliant releases. That’s what has allowed us to quickly make Big SQL for Hortonworks available to the market.
The Big SQL announcement IBM made recently must be a milestone you’re very proud of.
It certainly is. Back in February 2015, IBM became a founding member of the ODPi and since then the organization has worked to define specifications and compliance tests. In June 2016, the first five vendors with ODPi-compliant Hadoop distributions were announced. With that foundation in place, we were able to extend the capability of BigInsights to allow clients who purchase BigInsights to deploy Big SQL on their choice of Hadoop platform—both the IBM IOP Hadoop distribution and the HDP distribution. This [extension] is a big step forward for Big SQL, as it addresses a very large need that for years many Hortonworks customers have been asking for.
HDP is aligned with ODPi. Both Hortonworks and IBM are active members of that consortium, and IOP 4.2 and HDP 2.4 are both ODPi-compliant releases. That consortium has allowed us to quickly make Big SQL for Hortonworks available to the market.
Spark is an integrated component within the BigInsights IOP Hadoop distribution. What’s the difference between Spark SQL and Big SQL?
Great question. Both are SQL engines that can run against Hadoop data. Spark SQL is a core, integrated component of Spark, and these components can be used together to build rich, elegant analytics applications. You can do things in Spark that are difficult or impossible to do with a traditional relational database and SQL engines—such as machine learning and graph processing.
Spark SQL is another SQL engine, which is emerging and to which IBM is contributing. It is still in the early stages of becoming an enterprise-class SQL engine. It doesn’t have a full optimizer, it doesn’t have statistics, it doesn’t have the portfolio of query rewrite rules. And it hasn’t yet been put through the wringer of production enterprise-grade SQL applications. Now, Spark 2.0 made tremendous progress with Spark SQL 2.0, and I’ve blogged about that progress on the IBM Spark Technology Center website. But Spark SQL is not fully mature yet. Big SQL, on the other hand, is a powerful, comprehensive and mature SQL engine that is not open source but that brings decades of SQL expertise to a Hadoop environment. So no matter how complex the SQL, it will run, produce the right answer and complete in a reasonable amount of time.
When you go beyond basic querying, Big SQL continues to demonstrate that it has more function, more performance and more enterprise features, including security and other things like workload management, than any other SQL for Hadoop engine.
Invoking Spark technology from a Big SQL environment is a specific capability. How is this capability unique, and why is it important?
What we have done is created a natural extension to our SQL using a polymorphic user-defined table function to be able to invoke Spark runtime processing from within an SQL statement. We can use this enhanced SQL language capability to invoke Spark functions such as Spark Machine Learning or Spark Graph Processing. We pass data from the SQL engine into Spark. It does the unique processing that only Spark knows how to do, produces the result of a computation and passes that back into the Big SQL engine.
For example, it can create a machine learning model using training data, and then perform scoring against first test data and later production data. Now, in the middle of a normal Big SQL application, you can suddenly augment it in a very natural way by calling an SQL statement that then invokes Spark Machine Learning or Spark Graph Processing. As far as I know there is no other SQL engine in the world that can do this.
When you go beyond basic querying, Big SQL continues to demonstrate that it has more function, more performance and more enterprise features, including security and other things such as workload management, than any other SQL for Hadoop engine.
Can Big SQL essentially run on any Hadoop distribution with this latest announcement? Does this capability mean that Big SQL on Hortonworks also brings the Spark integration?
First, while I’d love to say that Big SQL can now run on any Hadoop distribution, I can’t. Not only is it not true, but to make that capability true would require a lot of work. Theoretically, it could—with enough effort. But today, Big SQL is supported on exactly two ODPi-compliant Hadoop runtimes: IOP from IBM and HDP from Hortonworks. The answer to your second question is “yes.” The full capability of Big SQL is available also on Hortonworks, and includes the ability to invoke Spark from Big SQL.
For those who have already made their Hadoop platform decision on Hortonworks, this capability is a big benefit.
You’ve got it. Once that Hadoop platform decision is made, and once you’ve loaded hundreds of terabytes of data into a Hadoop cluster, switching is not so easy. Each major Hadoop distribution offers the generic SQL, also known as [Apache] Hive SQL, the baby level of query processing that uses MapReduce. Although, this [feature] isn’t going to happen in the short run—say the next one-to-five years—and eventually Spark will replace MapReduce. A MapReduce-based SQL engine is not the best choice. Moreover, when you go beyond basic querying, Big SQL continues to demonstrate that it has more function, more performance and more enterprise features—including security and other things such as workload management—that other SQL for Hadoop engines don’t have.
People forget that a SQL statement can now be multiple megabytes of text in one single SQL statement. It can reference hundreds of tables. It can involve thousands of relational operations. A basic SQL processing engine just gets lost. It doesn’t know what to do, how to do it—it has never imagined having to deal with such complexity. At IBM we have been working on this kind of SQL for customers for nearly 40 years, and we are still innovating and becoming even better.
What’s next for SQL? And what’s ahead for the major SQL vendors?
Predictions are challenging, but let me gaze into my crystal ball. As SQL becomes ever more powerful and complicated, the job of optimizing and executing those very long complicated SQL statements becomes progressively more challenging as well. People forget that a SQL statement can now be multiple megabytes of text. It can reference hundreds of tables. It can involve thousands of relational operations. A basic SQL processing engine just gets lost. It doesn’t know what to do, how to do it. It has never imagined having to deal with such complexity.
At IBM, we have been working on this kind of SQL from customers for nearly 40 years, and we are still innovating to become even better. At the same time, data volumes are also increasing, and as a result the consequences of making a poor access path selection decision can be dire. A query can take many hours or days to run, or even fall into the infamous “never come back” category in which nobody knows exactly how long the query would take to run.
Apparently, more innovation building SQL engines is ahead. What if we look at SQL from a customer perspective? What is the most significant SQL trend you are seeing?
Enterprises that are using SQL don’t do so in isolation in just one application. Very few companies do everything only on Hadoop. Many are using relational databases in many other parts of their enterprise. IBM has built a set of solutions for both cloud-based and on-premises application, and for relational databases and Hadoop that offer the same SQL language for applications but have very different back-end infrastructures.
If you procure DB2 and install it on your IBM POWER server running AIX, you get a powerful SQL language environment. If, through the IBM Bluemix platform, you deploy a dashDB instance for analytics processing on AWS, you have a very different use case, a different back-end environment, but you get the same interface for applications and the same SQL programming language.
A final example is using a Hadoop cluster if you need to perform analysis of very large volumes of data that you want to keep for a long time, store very effectively and be very resilient to failures. Once again, you access this data through the same APIs using the same SQL language. You can even combine access to multiple sources using this unified SQL. A customer’s own application or a packaged application you can purchase—from IBM or a third party—can now generate and execute SQL against different back ends using the same front end. That feature is a powerful, simplified way of getting work done.
Many customers bring up the risk in their work, and Big SQL seems to minimize risk by giving the big data community more choice.
Exactly. IBM is expanding choice and minimizing risk. The interfaces that Big SQL uses are exactly the same in both IOP and HDP. You can continue to use and retain the Hortonworks clusters that you have already deployed with the hundreds of terabytes of data that you have placed into your data lake. And you can quickly install the world’s most powerful SQL on Hadoop engine into the same environment—that’s Big SQL.
Where can someone go to learn more about Big SQL?
Big SQL is part of the BigInsights product family, which is available for on-premises deployment and in the cloud.