Evolving the enterprise data warehouse beyond SQL with Apache Spark
Traditional enterprise IT infrastructure is architected around relational data warehouses and all other applications that communicate through the data warehouse. Line-of-business departments are applying pressure to use open source analytics and big data technologies such as Python, R and Spark for analytical projects and to deploy them continuously without having to wait for IT to provision them. Not being able to serve these requests leads to proliferation of analytics silos and loss of control of data.
IBM dashDB is an enterprise data warehouse solution from IBM that is available as a managed cloud service in the IBM Bluemix platform and docker container for on-premises deployment through an offering called IBM dashDB local. The dashDB solution has a stronghold of built-in, advanced analytics functions and deep integrations with analytics languages such as Python and R.
IBM has now started to roll out the next big stage of analytics inside dashDB: Apache Spark is now integrated in the engine. This capability has been made available initially in dashDB local, and the rollout to dashDB in IBM Bluemix is planned to follow. In addition to introducing you to dashDB with integrated Spark, this discussion seeks to help you understand the possibilities that this approach opens up to your data warehouse solutions.
Why is SQL not enough?
An enterprise data warehouse is not maintained for its own sake. It is meant to drive some form of analytics that is performed to gain business insight and support decision making. But when one takes a closer look at the types of analytics, clearly multiple levels of analytics are in play. And one level can expand another, which adds business value up to the situation well suited for decision making that is almost entirely automated by the analytics process.
The basic form of analytics is to crunch through the data that has been collected and understand the past—that is, what has happened. It involves a lot of filtering and grouping data together and then performing various forms of aggregation. This process is what we refer to as descriptive analytics, and it is also often referred to as business intelligence (BI) processing. It is indeed the sweet spot of relational data engines, and in particular SQL is well suited for expressing the kind of analytics queries for this type of analytics.
But when you run only descriptive analytics, then you still require a lot of very smart people to interpret the results and understand what they mean for the upcoming business. In other words, they basically try to make predictions using their own brains. Automating exactly such predictive analytics is the next level of art, and it is also often referred to as machine learning. A lot of standard algorithms are available to do machine learning. However, many are very hard to express through SQL. As a consequence, you can find a growing number of analytics frameworks, libraries and even dedicated languages that are suited much more to expressing and running these predictive algorithms.
The ultimate stage is prescriptive analytics, which means that not only predictions but also recommendations can be made or even implemented automatically. This automation does require a very close integration into applications and business processes with flexible programming and application programming interface (API) capabilities, which is something that is very challenging, if not impossible, to achieve with plain SQL mechanisms.
We can see that SQL engines do have their place in the analytics stack, and they are essential to do the descriptive part in a very scalable way. But for a modern analytics processing stack, the SQL processing has to be augmented with advanced analytics functions for predictive and prescriptive analytics.
One option to solve this problem is to layer the data processing engines where you keep your original copy of data in the data warehouse. But you also need to run some form of machine learning engine—such as Apache Hadoop, Spark, or Python and R runtime environments—on top and each with the respective machine learning libraries. The downside of this setup is inefficiency because all the data has to be transferred out of the relational system over a network to the analytics engine above. You may be tempted to work around this issue by replicating and storing the data local to the machine learning engine, which, however, introduces the complexity of outdated data and data governance.
Another option is to replace the relational engine—that is, the data warehouse—entirely with Hadoop or Spark and rely on Spark SQL, for example, for all SQL processing. However, then you give away all the major functional and scaling advantages available in major data warehouse engines such as dashDB with the in-memory BLU technology—not to mention enterprise and operational qualities of service a mature data warehouse provides.
As you will see later in this post, a third option is available that is indeed able to physically combine the relational—dashDB—and machine learning engine—Spark—to preserve the gains from both without the caveats of the other two options.
A data warehouse that natively speaks Spark
IBM dashDB local has tightly embedded Spark environments. It leverages the entire set of resources of the dashDB system, which also applies to the massively parallel processing (MPP) scale-out layout. Each dashDB node, with each data partition, is overlaid with a local Spark executor process. The existing data partitions of the dashDB cluster are implicitly derived for the data frames in Spark and thus for any distributed parallel processing in Spark in this data.
The colocation of the executors with the database engine processes minimizes the latency of accessing the data, resulting in a speed-up factor of 3–5 times for typical machine learning algorithms running in Spark. And this result is obtained even when compared with a remote Spark cluster access that is already optimized for dashDB data access.
The benefits of the integrated architecture are not only on the performance side. Read on to understand the functional benefits and new possibilities available to you now.
Things that are suddenly possible
A Spark-enabled data warehouse engine can do a lot of things out of the box that were not possible previously.
Out-of-the-box data exploration and visualization
In addition to the dashDB local container, IBM provides a Jupyter Notebooks container that is readily configured to work with the dashDB container as the Spark kernel to execute any interactive code entered in the notebook. Using this container, a data scientist can immediately start to explore the data in dashDB, leveraging the interactive user experience of Jupyter Notebooks and the richness of visualization libraries available for Python or Scala. The user can play with the data very easily by leveraging Spark’s scalable transformation operators.
Interactive machine learning
Expanding the interactive exploration further, the data scientist can leverage Spark’s machine learning library to train and evaluate predictive models interactively and again leveraging the visualization mechanisms to get a graphical presentation of models.
Each user can monitor the Spark workload in dashDB in all detail. On the web console’s Monitor / Workloads screen, users can see when an active Spark cluster is running:
By clicking on a Spark cluster entry, the standard Spark monitoring UI displays, providing access to all details of Spark monitoring.
One-click deployment to turn interactive notebooks into deployed Spark applications
When the data scientist has created and verified a successful analytics flow using the notebook, the question of how it can be operationalized comes up. The Jupyter Notebooks container for dashDB local provides an elegant, one-click deployment that transforms the code cells of the open notebook into a compiled and deployed Spark application inside dashDB. It automatically skips all cells that contain Jupyter-specific cell magic—annotated with
%%. In addition, you can flag with the comment
//NOT_FOR_APP other code cells that you do not intend for unattended background execution to skip them as well. This approach makes sense for cells that are primarily for interactive purposes such as generating and displaying plots.
The entire deployment process is automated. What happens under the hood is the creation of a source code file with a main method, the compilation of that class into byte code and the packaging of the compile result in a deployable jar file. And then the upload into dashDB occurs, specifically into the home directory of the user inside dashDB. Any compile errors that may occur display in the browser:
When the deployment has finished successfully, a list of specific options to invoke the deployed application displays. These options are described shortly.
Users can also create Spark applications from scratch using his development environment and deploy them into dashDB through the provided spark-submit.sh tool or through the dashDB REST API. Exporting the notebook into a zipped development project is also possible as a quick start for further custom development.
dashDB as a hosting environment for Spark applications
Once a Spark application has been deployed to dashDB, it can be invoked in three different ways: from a command line or script, a REST API or an SQL connection. You can use a command line or script remotely anywhere using the spark-submit.sh command-line tool. This tool is modeled after Spark’s standard spark-submit tool, and it supports all sorts of communication with dashDB and its integrated Spark environment. Just specify the deployed class and method name to invoke; the dashDB server name and credentials are provided through environment variables to the tool.
Using a REST API to invoke the application is very easy to interact with dashDB and invoke Spark logic from anywhere in the solution stack.
Invoking the application through a stored procedure inside an SQL connection enables you to easily extend any existing SQL application with Spark logic—for example, Microstrategy, Tableau or Cognos reports.
Out-of-the-box machine learning stored procedures
Another way to enhance your SQL-based applications through dashDB’s Spark is to use Spark indirectly without writing any Spark logic. We provide a certain set of Spark-based machine learning algorithms as prepackaged stored procedures. You can use them to train a machine learning model on an input table’s data to persist the model in dashDB and to invoke the model at other points to do a batch prediction on data in another table. You can also find a set of routines to manage stored models, such as granting other users access to it.
Highly flexible ELT and ETL
Spark is not only about analytics algorithms; it is also an excellent framework to perform sophisticated data transformations. This framework comes in handy for situations in which you have column values in your tables that need some form or feature extraction or also discretization. You can use the integrated Spark environment to run those types of transformations and extractions that are highly challenging or even near impossible to express in SQL, and write the results back to the transformed table. This type of in-warehouse transformation is often referred to as extract, load and transform (ELT). Because you run it in an integrated Spark application, the data doesn’t have to leave the data warehouse at all during the entire transformation.
Of course, you are also free to use Spark to read the source data, not from a table inside dashDB but from any other remote source and data format such as Parquet data in object storage. After the transformation, the writing occurs in a relational table inside dashDB. You can use Spark effectively as a parallelized extract, transform and load (ETL) mechanism in dashDB.
Data processed and landed in motion
As you can see, you can use Spark to access remote data and formats. Another variation is to use Spark for processing data that is not stored at all but instead is data in motion. Using Spark’s streaming API, you can deploy and run applications in dashDB that directly subscribe to some message hub and permanently process and insert the relevant messages into dashDB tables. You can therefore turn dashDB local very easily into a landing zone for Internet of Things data in this way, for example.
Not much has been said to this point about languages. But a primary language that Spark supports is Python. You can of course also create Python applications using the PySpark interface to work with Spark and deploy and run them in dashDB or use a Python notebook to work with Spark in dashDB interactively. But any PySpark application is eventually also a Python application. And you can effectively deploy and run any Python application in dashDB local. Your Python application is not forced to make use of the PySpark API, and you have full freedom to use the richness of the Python ecosystem of libraries and make use of them in your deployed Python applications inside dashDB.
The dashDB Local solution can be summarized as follows:
- It helps enterprises modernize their data warehouse solutions with advanced analytics based on Spark.
- It enables Spark applications deployment and processing of relational data to gain significant performance and operational quality of Service benefits.
- It facilitates creating end-to-end analytics solutions from interactive exploration and machine learning experiments, verification of analytics flows and easy operationalization by creating deployed Spark applications. It also enables hosting Spark applications in a multitenant enterprise warehouse system and integrating them with other applications through various invocation APIs.
- It allows users to invoke Spark logic through SQL connections.
- It can land streaming data directly into tables through deployed Spark applications.
- It is capable of running complex data transformations and feature extractions that cannot be expressed with SQL using integrated Spark.
Opportunities to learn more
Check out further information on dashDB local. Also, see the tutorial playlist for dashDB with Spark. And be sure to attend IBM Insight at World of Watson 2016, where you can see me presenting and demonstrating dashDB with Apache Spark in action at Session DMT-1479, Sparkified dashDB, Tuesday, 25 October 2016, at Mandalay Bay, Palm B. A live demo of IBM dashDB Local with Apache Spark is also presented at IBM at World of Watson 2016’s Expo, Monday to Wednesday, 24–26 October 2016, at the IBM Analytics booth.