Spark: Expanding the possible

Vice President of Product, Platfora

The best way of gauging the impact of any new technology is to examine the capabilities it delivers. Every technological advance represents a forward step in human capability, either by giving us new capabilities or by augmenting capabilities we already have. For example, the airplane gave us the gift of flight—an ability previously out of our reach—and then the jet engine accelerated the speed of our flight, tremendously enhancing a capability we already had.

Apache Spark has generated a great deal of interest and enthusiasm, in part because it not only enhances existing capabilities, but offers new ones. Spark significantly augments and optimizes the analysis that organizations already engage in, including by simplifying and accelerating data preparation and providing instant feedback on how requested changes affect data. At the same time, Spark also brings entirely new analytical capabilities to the table, including advanced statistical analysis and machine learning.

Empowering the Hadoop ecosystem

Spark enables such improvements by altering some basic assumptions about how a Hadoop environment is put together. Hadoop, though often discussed as if a single technology, is actually an entire ecosystem of big data technology. Until now, the joint center of that ecosystem has been the Hadoop file system itself (HDFS) and MapReduce, a programming framework that enables users to process massive volumes of data across dozens, hundreds or even thousands of servers in a Hadoop cluster.

MapReduce, though unquestionably powerful, is a challenging framework in which to write programs for big data processing. It has a reputation for overcomplexity—downright clunkiness, some say—even among highly experienced and capable programmers. Spark, by contrast, provides a streamlined, powerful framework for multi-structured data processing that not only incorporates the MapReduce pattern, but is in fact replacing the MapReduce implementation for many organizations.

But Spark is far more than just an improved version of MapReduce. As a general framework for cluster computing, it opens entire new worlds of possibility for Hadoop environments, including SQL processing (via DataFrames and Spark SQL), machine learning (via Spark MLlib), graph processing (via Spark GraphX) and stream processing (via Spark Streaming).

Before the advent of Spark, an organization’s technical staff was required to integrate and maintain multiple technologies—necessitating a variety of advanced skill sets for use—to provide such capabilities. But with Spark, the same capabilities are available right out of the box. If a MapReduce environment is a shop in which craftsmen spend uncounted hours creating fine products by hand, then a Spark environment is a shop whose 3D printer rapidly produces the same products—as well as several that the craftsmen can’t produce by hand—in a fraction of the time.

Setting the standard for in-memory cluster computing

By delivering such capabilities, Spark is establishing itself as the standard in-memory cluster computing framework, shortening the time required for development of analytical programs and amplifying the effects of technical staff’s contribution by extending the reach of analytics to the entire organization.

A key to Spark’s evolution as a standard is embracing other growing standards, as demonstrated by such moves as the inclusion of the R statistical programming language via SparkR. R has experienced phenomenal growth in recent years; the quants who used to work in SAS or MATLAB are flocking to R because of its openness and community of users. SparkR provides an on ramp for those that use R to work in the world of big data. as noted above, Spark has expanded its machine learning capabilities, enabling users to create and execute increasingly complex analytics pipelines via the MLLib and GraphX libraries. At the same time, the ability to use popular windowing functions in Spark SQL makes it more useful for business users wanting to look at data over time frames, e.g. year-over-year analysis.

Simplifying the use of technologies that businesses are already working with (such as R), and supporting new ways to perform standard analytics functions, Spark is augmenting the big data analysis capabilities of the organizations that use it. Moreover, by enabling machine learning and other predictive analytics techniques to these same organizations, Spark is bringing a whole new analytics paradigm into reach. As businesses push against the boundaries of possibility for big data analysis, Spark continues to reset and redefine those boundaries.       

To learn more about the possibilities that Spark is opening up, check out Platfora’s recent announcement which describes a new approach to Spark-enabled data preparation. And download a complimentary chapter of Advanced Analytics with Spark, an O’Reilly guide to performing large-scale data analysis using Apache Spark.

Deepen your data science explorations for an open and unified analytics platformHadoop and Spark, as well as IBM resources on Spark thought leadership, IBM Big Data & Analytics Hub thought leadership content on Spark, the IBM Spark Technology Center and an IBM Big Data University Spark Fundamentals course.

And of course, be sure to register for Datapalooza, and stay tuned to see when an event is coming to a city near you.