Blogs

Spark Is the New Black

The rapid rise of the Spark distributed framework has Hadoop developers taking notice

The Apache Spark distributed framework and data processing engine continues to gain traction in the developer community. In August 2014, John Choi, an IBM Big Data & Analytics Hub blogger, described IBM’s work with Cloudera, Databricks, and Intel, saying that it enables organizations to take advantage of Spark on IBM systems running IBM® InfoSphere® BigInsights™ data analytics.1 Since then, Spark has been garnering a lot of attention. For example, the recent Strata + Hadoop World conference in San Jose included 14 presentations that mention “Spark” in their titles.2 Why is the developer community taking so much notice of Spark recently?

The Spark alternative

To understand Spark’s impact, consider Apache Hadoop’s role in the big data ecosystem. The most important modules in Hadoop are the Hadoop Distributed File System (HDFS), a file system that can be spread across multiple machines, and MapReduce, a framework that can split up a computing job across multiple processors. This ability to spread data and processing across several computers enables the processing of very large amounts of data using collections of commodity hardware. Another alternative is to use a provider of cloud-based services such as Amazon Web Services (AWS) to start up a collection of remote virtual computers that need to exist only for the duration of the compute job. These options of using commodity hardware collections or remote virtual computers have been crucial in letting small companies work with big data.

Machine-learning researchers at the University of California, Berkeley’s AMPLab—Algorithms, Machines, People (AMP)—found that Hadoop’s MapReduce framework was not efficient enough for the kinds of iterative processing necessary for their work. When distributing the processing and collecting the results, MapReduce does a lot of writing to and reading from disks, which takes enough time that it makes Hadoop jobs very batch oriented. That is, an operator starts a job, waits a while, and then checks the results, much like mainframe processing of bygone days.

In 2009, the AMPLab researchers developed Spark as an alternative to MapReduce. Spark took better advantage of memory on the distributed set of machines than MapReduce and greatly reduced the need for the disk I/O that was slowing things down. Spark also offers the added bonus of being easier to program than MapReduce because developers don’t need to split up and coordinate their logic across separate Map and Reduce routines. The AMPLab researchers donated Spark to the open source Apache Software Foundation in 2013, and members of the team went on to found Databricks, a company that helps organizations with Spark development.

Development with Spark

Developers use Spark by importing libraries that implement the distributed processing into their own programs so that they can call the functions that make this processing possible. Libraries that implement most of Spark are available for Python and Java programmers. However, to take full advantage of its libraries for machine learning, stream processing, data graph processing, interactive SQL queries, and other specialized feature categories, the increasingly popular Scala programming language3 is most often used to create Spark applications.

Scala is a more functional language, as compared with the imperative style used by many widely used programming languages—that is, its programs pass functions between processing modules to perform a larger percentage of tasks than those languages. Scala’s functional style is a good fit for the way Spark lets developers define their own data structures and programming logic for distribution across a computing cluster.

The most well-known Hadoop distributions have all added support for Spark, but Hadoop isn’t even necessary to take advantage of Spark. Developers can use Spark with the Apache Cassandra, Apache Mesos, and AWS platforms, and the ability to run Spark on a single machine with no distributed system behind it makes it easy for novices to get started.

Spark inspiration

A recent look at the Indeed.com job search website showed that in one three-day period Apple, Dow Chemical, Major League Baseball, The Nielsen Company, Verizon, and many lesser-known companies had postings for jobs requesting Spark skills. Spark skills are so desired, in fact, that according to the O’Reilly 2014 Data Science Salary Survey, Spark developers earn the highest median salary among developers using the 10 most widely used Hadoop development tools.4

Spark development offers the capability to do machine learning, stream processing, and other tasks that were a bad fit for MapReduce’s batch-oriented approach. And, it gives developers the ability to perform many typical Hadoop tasks much more quickly on cost-effective computing clusters. Together, these capabilities are giving organizations that deploy InfoSphere BigInsights and other distributed platforms great new ideas for applications that take better advantage of their data.

Please share any thoughts or questions in the comments.

1IBM Expands Hadoop Commitment with Support for Spark,” by John Choi, IBM Big Data & Analytics Hub blog, July 2014.
2Big Data at Netflix: Faster and Easier” and “How Big Data Transforms the Way Telcos Operate and Do Business,” Strata + Hadoop World conference presentations, Strata + Hadoop World conference website, February 2015 in San Jose, California.
3 Job trends for Clojure, Erlang, Groovy, Lisp, and Scala between January 2006 and January 2015, Indeed.com website.
42014 Data Science Salary Survey,” by John King and Roger Magoulas, O’Reilly Media, November 2014.