Blogs

The power behind Apache Spark

Big Data Evangelist, IBM

Building robust analytics applications requires careful planning, many iterations and fine tuning. Imagine reusing and dynamically updating a basic set of models more quickly than ever to address a broad set of requirements with a simple tag. For example, when developing a watch list of terms to monitor on social media, how quickly can you change the watch list or expand the data set? Or what if you developed a facial recognition application for a cloud platform to spot your dog in family photos? Could you quickly change the subject of the photos to your spouse or in-law?

Data scientists need advanced solutions to make these analytics artifacts a reality. This need is driven in large part by the explosion of unstructured text that is quite likely to hit 40 zettabytes by 2020. Apache Spark is one of the most important innovations in this regard, and it’s rapidly maturing into a power tool for development of machine learning–driven analytics applications in the world of big data.

Sparking a revolution

One recent maturation milestone from more than a year ago was the elevation of Spark to top-level status by the Apache Software Foundation. Initially developed at the University of California, Berkeley’s AMPLab starting in 2009, Spark has not yet achieved widespread commercial adoption. Nevertheless, a growing number of organizations have put Spark into production in their advanced analytics environments.

Forward-looking organizations see Spark as a platform to complement their investments in advanced analytics, machine-learning platforms and big data frameworks such as Apache Hadoop. Specifically, Spark’s emphasis is on in-memory computing and graph-centric, machine-learning applications. Essentially, Spark is a next-generation cluster-computing solution, runtime processing environment and development framework for in-memory advanced analytics.

Currently, Spark version 1.3.1 offers a layered, distributed-computing framework that can leverage much of the Hadoop storage environment, including the the Hadoop Distributed File System (HDFS). And the Spark open source community continues to gain active members and boasts over 465 contributors in its most recent count.

Spark’s core design feature is the ability to support iterative, distributed, parallelized algorithmic program execution entirely in memory, without the need to write out result sets after each pass through the data. This capability makes Spark well-suited for the growing range of real-time applications—such as Internet of Things applications—in which much or most of the data analysis will be performed on cached, live data, rather than stored, historical data.

Spark’s performance advantages come from parallelizing models across distributed in-memory clusters. Spark can combine SQL, streaming and graph analytics within the same application. It can access and process data stored in HDFS, Apache HBase, Apache Cassandra, and any other Hadoop-supported storage system. And it supports programming interfaces for Java, Python, R and Scala.

Packing a punch

Spark includes the following runtime engines that are optimized for in-memory processing, streaming and high-performance parallel graph analysis, as well as associated libraries:

Spark SQL: Spark provides a Spark SQL module for querying and working with structured data. It supports seamless mixing of SQL queries with Spark programs, and enables end users to query structured data as a distributed data set in Spark, with integrated application programming interfaces (APIs) in Java, Python and Scala. This tight integration makes running SQL queries alongside complex analytics algorithms easy. Functions can be applied to results of SQL queries. 

With the ability to load and query data from various sources, Spark SQL supports unified data access and provides a single interface for efficiently working with structured data, including Apache Hive tables, parquet files and JavaScript Object Notation (JSON) files. The interface can also be used for querying and joining different data sources. It is compatible with Hive and can run unmodified Hive queries on existing data stores. Spark SQL reuses the Hive front end and metastore, offering full compatibility with existing Hive data, queries and Unified Disk Formats (UDFs). Spark SQL can use existing Hive metastores, SerDes and UDFs. It also includes a server mode for industry-standard Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC), so that existing analytics tools can work with Spark queries. And Spark SQL uses the same engine for both interactive and long queries, mid-query fault tolerance and scalability to large jobs.

Spark Streaming: Streaming data can be captured in the Spark Streaming API during a set time interval or mini-batch processing. Where Hadoop has abstraction layers in MapReduce and YARN, Spark Streaming relies on several abstraction layers: resilient distributed dataset (RDD) systems, discretized streams (D-Streams) and resilient distributed graph (RDG) systems.

An RDD is a read-only partitioned collection of records. It includes data-integration artifacts that provide instructions for data lineage, transformation and persistence. RDDs make up a distributed set of objects that can be cached in memory across multiple Spark cluster nodes and automatically reconstruct themselves upon failure.

D-Streams are a continuous sequence of RDDs representing data streams that are created from live ingested data or generated by transforming other D-Streams. These streams are automatically divided into batches, persisted in distributed memory, replicated for fault tolerance and processed in short time-burst intervals. Results of D-Stream processing become RDDs that are processed using Spark applications.

GraphX: Spark includes the GraphX graph-analysis engine. It extends RDDs to introduce RDGs, which associate records with vertices and edges in a graph and provides a collection of expressive computational primitives. Using far less code than other graph abstractions, RDGs distribute graphs as tabular data-structures for parallel, distributed, fault-tolerant, in-memory execution in Spark. It also enables end users to interactively load, transform and compute on massive graphs.

MLLib: Spark also provides MLLib, a scalable machine-learning library that consists of common learning algorithms, utilities and APIs. Chief among these components are classification, collaborative filtering, dimensionality reduction, clustering and regression, as well as underlying optimization primitives.

Hungry for more information on Spark? Get started learning more about Spark today, and register for Spark Summit in San Francisco, California, June 15–17, 2015. For further details, check out IBM BigInsights 4.0, an enhanced solution with Spark support.


Co-authors

Joel Horwitz: Joel Horwitz is the World Wide Director of Marketing for the IBM Analytics Platform. He graduated from the University of Washington in Seattle with a Masters in Nanotechnology with a focus in Molecular Electronics. He also hails from the University of Pittsburgh with an International MBA in Product Marketing and Financial Management.  Joel designed, built, and launched new products at Intel and Datameer resulting in breakthrough innovations. He set and executed upon strategies at AVG Technologies that led to accretive acquisitions. He established a big data science team and the first Hadoop cluster in the Europe. Most recently, he spearheaded new branding and positioning for Alpine Data Labs and H2O resulting in a new website, sales collateral, product messaging that resulted in significantly increased sales.

Kimberly Madia: Kimberly Madia is a World Wide Product Marketing Manager for the IBM Big Data Portfolio, focused on InfoSphere Streams. She has been with IBM since 2001. Kimberly earned an undergraduate degree in Computer Science from Allegheny College and an MBA in Strategy and Information Technology at Carnegie Mellon University. During her career at IBM, she has worked as a technical support representative and a business partner enablement manager. Currently, Kimberly is focused on developing solutions to support big data initiatives around stream computing and security intelligence. She has published numerous articles in IT publications and is a regular speaker at tradeshows, user groups and web conferences.