Spark insights from the recent Cambridge meetup

Content Marketing Manager, Emerging Markets, IBM

Why Spark?

Over the past five years, Apache Spark has grown rapidly and evolved into a more mainstream technology. Complemented by other big data technology, users see real value in bringing Spark into their IBM Analytics platforms.

Attendees at the recent Spark meetup in Cambridge, Massachusetts were eager and interested to learn how Spark differed from Hadoop and MapReduce. “IBM thinks of it as Hadoop and Spark together, not one or the other,” said Brandon MacKenzie, Worldwide Technical Sales, Large-Scale Analytics at IBM.

Gari Singh, chief technical officer at IBM, gave the audience a detailed introduction to Apache Spark. Of the examples presented, the capabilities of the Spark Streaming library were of particular interest. Spark Streaming, an extension of the Spark Core API, is a scalable, high-throughput, fault-tolerant processing of live data streams.

Below is a quick visual that shows how Spark Streaming works. First, the input stream comes into Spark Streaming; the stream is then broken up into batches of data. Then, it is fed into the Spark Engine for processing, and finally, it is sent out as batches of processed data.

Spark Streaming scenario

The following is a scenario that was used at the meeting:

Count the number of words coming in from the TCP socket.

Import the Spark Streaming classes and some implicit conversions:

  • import org.apache.spark._
  • import org.apache.spark.streaming._
  • import org.apache.spark.streaming.StreamingContext._

Create the StreamingContext object:

  • val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
  • val ssc = new StreamingContext(conf, Seconds(1))

Create a DStream:

  • val lines = ssc.socketTextStream("localhost", 9999)

Split the lines into words:

  • val words = lines.flatMap(_.split(" "))

Count the words:

  • val pairs = => (word, 1))
  • val wordCounts = pairs.reduceByKey(_ + _)

Print to the console:

  • wordCounts.print()

Note that when each element of the application is executed, the real processing doesn’t actually happen yet—you have to explicitly tell it to start. Once the application begins, it will continue running until the computation terminates. No real processing happens until you specify the following:

  • ssc.start() // Start the computation
  • ssc.awaitTermination() // Wait for the computation to terminate

If you want to run the full example, the entire code and application can be found in the NetworkWordCount.

We encourage Boston-based Spark and machine learning enthusiasts to dive even deeper with us later this month at hack/reduce in Cambridge on May 28–30, 2015 at the "How Smart Can You Hack with Spark" event.

Still hungry for more information on Spark? Then you’re encouraged to register for Spark Summit in San Francisco, California, June 15–17, 2015. For further details on IBM’s current offering in this market, check out IBM BigInsights 4.0, an enhanced solution with Spark support.