Opening up Big SQL for all: An interview with Paul Yip
Paul Yip runs IBM’s worldwide product strategy for Apache Hadoop and Apache Spark. Yip has spent more than four years in the trenches helping customers deploy and manage big data solutions. Prior to this role, he worked in product management for Hadoop and technical presales roles, specializing in database management, data warehousing and online transaction processing (OLTP) solutions. Yip has authored three books for IBM Press, and he is a prolific contributor to the IBM developerWorks community.
IBM is extending Big SQL, which was formerly exclusive to the IBM Hadoop Platform, to the Hortonworks Data Platform (HDP). I recently asked Yip, one of the early proponents of the Big SQL on Hortonworks project, to give us some insight on what this transition means for the industry and its benefits.
Why has IBM decided to open up Big SQL and allow it to be deployed on HDP?
In the early years of Hadoop, a lot of posturing was in play from different vendors that resulted in a fragmented platform with multiple combinations of components that were not particularly interoperable. As an industry, we realized that this situation was generally unhealthy.
The Open Data Platform initiative (ODPi) we helped originate last year is about Hadoop vendors, system integrators and customers establishing standards for more compatibility and skills reuse for Hadoop platforms. This effort allows vendors to focus on innovation for Hadoop while reducing the cost of porting and testing for compatibility. Customers benefit too because they spend less time retraining staff if they move between ODPi environments. Having both the current versions of IBM Open Platform (IOP) and HDP ODPi certified made it much easier for us.
Aside from that, the number-one issue we hear from customers who have deployed Hortonworks is they want data to be more accessible with SQL. They want more performance, more concurrent access than ever. So we’re making that happen. We’re simply responding to market demand. Hortonworks obviously has a significant presence in the market, and so our response makes sense for us.
A lot of Hadoop ecosystem projects offer some kind of SQL front end for Hadoop. Why is there so much customer demand for Big SQL in particular?
For years, there has been the promise or hope that Apache Hive would be the way to do SQL on Hadoop. And yet, today, at least 23 other SQL engines I’m aware of are in the ecosystem, which is a clear indicator that the market is still young and undecided. General recognition prevails that SQL on Hadoop is still a problem in need of a proper solution.
We asked ourselves: what would be the one killer feature? Well, what if we could support existing SQL syntax nuances from IBM DB2, IBM PureData System for Analytics powered by Netezza technology and Oracle Database all at once? In Big SQL, that’s exactly what we did.
However, none of these engines really tackle the big problem of complex data warehousing queries on Hadoop, and this really key workload is what we’re targeting with Big SQL. With the low cost per gigabyte that characterizes Hadoop, customers are looking for ways to build Hadoop around their more costly traditional data warehouse technologies. When they start to reach capacity, they’d normally buy more, which can be very expensive. What they want to do now is offload some of the workloads into Hadoop; but if they do that, they will need to rewrite many of their SQL queries just to get them to work—let alone perform.
Even though SQL is supposed to follow ANSI and ISO standards, few of the open source SQL engines for Hadoop properly support ANSI SQL, and differences between Oracle and Teradata exist for syntax where standards do not exist. So offloading SQL workloads to Hadoop is very challenging.
With Big SQL, we bring something very special to Hadoop that no other technology can. If you were trying to offload data warehouse workloads to Hadoop, we asked ourselves, what would be the one killer feature? Well, what if we could support existing SQL syntax nuances from IBM DB2, IBM PureData System for Analytics powered by Netezza technology and Oracle Database—and all at once? In Big SQL, that kind of support is exactly what we did. You can use the syntax you know already and write queries for Hadoop in just the same way you would write them for your other systems. It unlocks the full potential for SQL on Hadoop, and this step forward is huge for customers using HDP.
Are there other advantages?
Concurrency. The Big SQL engine can handle and coordinate queries even if hundreds of concurrent users are running the most difficult, complex queries. Again, this capability is great for data warehousing workloads in which you will definitely have many users hitting the platform.
Big SQL works exceptionally well in environments where data exceeds available memory. Architecturally, this factor is the only way you can get maximum concurrency, and it’s what Big SQL is built for. And Big SQL makes offloading your data warehouse and giving mainstream users a query-able archive of older data far easier to support additional use cases such as predictive modeling and machine learning—which get more powerful with more data. And because the data is all in Hadoop, data scientists can do whatever they like with it, using whatever tools are best for the job.
Why is Big SQL better than other SQL engines for Hadoop?
Hive got more performance by scaling to more nodes using MapReduce, but just because you can scale to hundreds or thousands of machines to reach your performance objectives doesn’t mean you want to it. I’d rather manage 100 nodes instead of 1,000 nodes, any day. Spark changed how we think about scale-out performance by exploiting in-memory processing, and now you’re seeing the same workloads run faster with fewer nodes using Spark.
But here’s the thing: while everyone (now) is trying to get more performance by exploiting memory, we are going in the opposite direction with Big SQL. This approach is different, so I hope readers are paying attention. Big SQL works exceptionally well in environments where data exceeds available memory. Architecturally, this is the only way you can get maximum concurrency, and this is what Big SQL is built for.
In-memory approaches used by technologies such as Spark SQL are very fast but will have challenges scaling to support many concurrent users running complex SQL. Not to knock Spark SQL, but IBM is a huge contributor to Spark and Spark SQL. But people need to recognize that SQL is a language, not a workload. Big SQL is engineered for warehousing on Hadoop. Spark SQL and Big SQL can be used concurrently for the same data sets on Hadoop because they’re open. Use Spark SQL, Big SQL, and other SQL tools together for what they are good for.
Many customers want to use open source because it is free. Big SQL is not open source, and it isn’t free. What’s your response?
Hadoop and open source aren’t really free. The software is free, but you need to run the software somewhere, and you need people to make it work. The time to make it work the way you want also isn’t free. When you want support for a Hadoop distribution, you don’t pay for the software, but you probably pay for support on a subscription basis. So Big SQL is available on a subscription basis too, for those who prefer subscribing to the tool. Paying on a month-to-month basis costs less only for as long as you get value for it.
Nevertheless, many users are starting with small- to medium-sized Hadoop clusters. As they get more SQL workloads and users, they find themselves adding nodes very quickly. Big SQL is comfortably five-to-ten times faster than Hive for complex queries, which means you can get more done with fewer nodes. So an investment in Big SQL could mean a dramatic reduction in your overall TCO for Hadoop. Big SQL isn’t free, but what if it could save you money in the end?