How Spark is tuning up the logical data warehouse

Big Data Evangelist, IBM

It doesn’t look as if Spark is going to dominate the big data and analytics universe like some juggernaut of destruction.

Instead, Spark will add significant value to the increasingly hybridized world of big data tools and platforms. It’s clearly complementary, not competitive, with Hadoop, NoSQL, relational databases and every other functional component within this multi-platform world. Just like established tools and platforms, Spark and its component technologies—as a unified stack or as discrete components—will quickly find their sweet spot of use cases, applications and deployment zones where they sparkle brightest. Take a look at this recent ITKnowledgeExchange column in which I discussed the notion of Spark as a “fit-for-purpose big data component.”

And it’s likely Spark will become a core technology in the logical data warehouse (LDW). For a refresher, check out this quick overview on the enduring relevance of the LDW. In order to gauge Spark’s most valuable usage, you should regard the LDW as a zone architecture in which Spark’s fit-for-purpose niche can be assessed through a process of elimination.

Spark does not squarely fit into the LDW’s data integration zone, although Spark has a streamexacting component. Instead, it relies on external platforms such as Hadoop (where unstructured sources are concerned) for these functions.

Furthermore, Spark doesn’t logically fit into the LDW hub zone, which supports data aggregation, governance, master data management and in-database analytics. However, Spark tools must access data and functions supported on hub zone platforms, including the relational databases that enable enterprise data warehousing. Likewise, Spark modelers may access Hadoop because it supports some hub functions (such as governance) for data types such as unstructured social media sentiment.

Data scientists' workbench of choice

What I’m pointing to is Spark’s optimal role in the LDW: the access, delivery, modeling and interaction. In particular, Spark’s primary effectiveness is as the workbench of choice for data scientists who interactively and iteratively explore, build and tune statistical models for machine learning, graph and streaming analytics. It’s the premier development tool for the next generation of machine-learning applications that require low-latency distributed computations. In LDW terms, it’s an in-memory tool for the fast data marts one might build on the Hadoop Distributed File System (HDFS) storage layer. Leveraging HDFS, Spark incorporates a query engine (SparkSQL), stream-computing engine (Spark Streaming), graph analytics engine (GraphX) and machine-learning library (MLLib) that take it far beyond Hadoop in its capabilities as a major low-latency data science modeling tool.

That’s a fairly substantial zone within the evolving LDW. In fact, it’s the key accelerator for developer productivity in the big data, cognitive computing and machine-learning applications for which Spark is best suited. So, as you’re tracking the Apache Spark community’s roadmap for development of the open source code base, it becomes clear why enhancements such as support for R programming, new machine-learning algorithms and application programming interfaces (APIs), new math and stats functions, code generation, managed memory and cache-aware data structures are at the top. All of them facilitate more agile, efficient and high-performance development functions performed by Spark-wielding data scientists within the LDW.

How does it fit?

And as you’re considering Spark tools for your LDW, you need to consider the extent to which they fit seamlessly into a fluid query architecture with your preferred data warehouse and Hadoop platforms. The SQL dialects of the various platforms—relational, Hadoop and, increasingly, Spark—that comprise the LDW need to be accessible through an abstraction layer. This enables all applications (for example, business intelligence tools doing reporting and dashboarding or Spark development tools doing statistical modeling) that query any data within the LDW to speak one simple SQL dialect that spans it all.

SQL now pervades the Hadoop market thanks to initiatives and interfaces such as IBM Big SQL. Spark is essentially an evolution of Hadoop, adding its own query dialect—Spark SQL—and associated query processing engine as a layer that Spark applications can use instead of Big SQL, HiveQL or other query tools. Or you can, if you wish, deploy Spark SQL in a fluid query environment that can abstract all of these query interfaces to access and manipulate any data that’s stored on the data warehouse, Hadoop clusters and other platforms within the LDW.

Fluid query is already a practical reality in the LDW, thanks to IBM Fluid Query (IFQ) V1.5. IFQ extends IBM PureData System for Analytics support for fast, structured queries, enabling querying of Hadoop and other database sources from the data warehouse. It lets you perform tasks on the platform best suited for the workload. And it supports storage of infrequently accessed data in Hadoop to extend the data warehouse.

For a good overview of the recent IFQ V1.5 release, which supports Spark, I recommend this blog by Rich Hughes: "IBM Fluid Query 1.5: Extending Insights Across More Data Stores." Among the other features highlighted in Hughes’s post, IFQ supports Spark SQL V1.2.1 and 1.3. The benefit of IFQ for Spark developers is that when using PureData System for Analytics as their data warehousing platform and any of several Hadoop distributions (IBM BigInsights V2.1, 3.0 and 4.0; Cloudera V4.7 and 5.3; and Hortonworks versions 2.1 and 2.2) they can now do the following:

  • Execute Hadoop or Spark queries from PureData System for Analytics
  • Merge data from PureData with DB2, dashDB or Oracle
  • Rapidly move data from PureData to Hadoop and vice versa
  • Analyze data stored in different systems from a single query

IBM Fluid Query V1.5 is now generally available as a software addition to PureData System for Analytics clients. To learn more about IBM Fluid Query V1.5, register for this July 29 webinar: "IBM Fluid Query - Unifying Data Access Across the Logical Data Warehouse"