Blogs

What is Spark?

Making the complex simple

Managing Director of Intelligent Business Strategies Limited, Intelligent Business Strategies Limited

All the hype around Apache Spark over the last 18 months gives rise to a simple question: What is Spark, and why use it? Spark is an open source, scalable, massively parallel, in-memory execution environment for running analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster.

Big data processing

Much like MapReduce, Spark works to distribute data across a cluster, and process that data in parallel. The difference is, unlike MapReduce—which shuffles files around on disk—Spark works in memory, making it much faster at processing data than MapReduce.

http://www.ibmbigdatahub.com/sites/default/files/whyspark_embed.jpgSpark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala.

These applications execute in parallel on partitioned, in-memory data in Spark. And they make use of prebuilt analytics algorithms in Spark to make predictions; identify patterns in data, such as in market basket analysis; and analyze networks—also known as graphs—to identify previous unknown relationships. You can also connect business intelligence (BI) tools to Spark to query in-memory data using SQL and have the query executed in parallel on in-memory data.

Spark can run on Apache Hadoop clusters, on its own cluster or on cloud-based platforms, and it can access diverse data sources such as data in Hadoop Distributed File System (HDFS) files, Apache Cassandra, Apache HBase or Amazon S3 cloud-based storage. Spark consists of a number of components: 

  • Spark Core: The foundation of Spark that provides distributed task dispatching, scheduling and basic I/O
  • Spark Streaming: Analysis of real-time streaming data
  • Spark Machine Learning Library (MLlib): A library of prebuilt analytics algorithms that can run in parallel across a Spark cluster on data loaded into memory
  • Spark SQL + DataFrames: Spark SQL enables querying structured data from inside Java-, Python-, R- and Scala-based Spark analytics applications using either SQL or the DataFrames distributed data collection
  • GraphX: A graph analysis engine and set of graph analytics algorithms running on Spark
  • SparkR: The R programming language on Spark for executing custom analytics 

http://www.ibmbigdatahub.com/sites/default/files/blog_diagram_mf.jpgScalable analytics applications can be built on Spark to analyze live streaming data or data stored in HDFS, relational databases, cloud-based storage and other NoSQL databases. Data from these sources can be partitioned and distributed across multiple machines and held in memory on each node in a Spark cluster. The distributed, partitioned, in-memory data is referred to as a Resilient Distributed Dataset (RDD).

A key Spark capability offers the opportunity to build in-memory analytics applications that combine different kinds of analytics to analyze data. For example, you can read log data into memory, apply a schema to the data to describe its structure, access it using SQL, analyze it with predictive analytics algorithms and write the predictive results back to disk. The results can be in a columnar file format for use and visualization by interactive query tools.

Technology integrations

IBM made a strategic commitment to using Spark in 2015. A number of IBM software products now integrate with Spark. Some are shown in this table along with a description of how they integrate. 

Technology

Integration with Spark

IBM SPSS Analytic Server and IBM SPSS Modeler

Spark MLlib algorithms are invoked from IBM SPSS Modeler workflows.

IBM BigSQL

Spark analytics applications can access data in HDFS, S3, HBase and other NoSQL data stores using IBM BigSQL, which returns an RDD for processing; IBM BigSQL can opt to leverage Spark if required when answering SQL queries.

IBM InfoSphere Streams

Spark transformation functions, action functions and Spark MLlib algorithms can be added to existing Streams applications.

IBM Cloudant on IBM Bluemix

Data in Cloudant can be accessed and analyzed in Spark analytics applications in the Bluemix cloud.

IBM BigInsights on Bluemix

Data in IBM Open Platform with Apache Hadoop can be accessed and analyzed in BigInsights Data Scientist analytics applications using Spark in the Bluemix cloud.

Swift Object Storage

Data in Swift Object Storage can be accessed and analyzed in Spark analytics applications.

Spark as a service

IBM has made Spark available as a service on the cloud-based IBM Bluemix platform with a browser-based Data Science notebook. Data scientists can get up and running quickly to start developing scalable, in-memory analytics applications. This service includes support for streaming analytics in Spark, Spark machine learning and graph analysis.