Connecting with data at Spark Summit Europe 2016

Developer Advocate, Cloud Data Services, IBM

Last week, I attended Spark Summit Europe in Brussels in my quest to learn all there is to know about Apache Spark. While there, I explored a range of developments in the Spark world and presented a demo introducing the IBM Data Science Experience and PixieDust.

Connecting your data source to Spark

At the summit, I noticed several talks that aimed to simplify ways of making data connections to Spark. In one, Ross Lawley, for MongoDB, talked about how to connect Spark to your own data source using the MongoDB connector for Spark.

Similarly, Dvir Volk and Shay Nativ, of Redis Labs, discussed ways of accelerating machine learning using Redis Modules. The Spark–Redis Module, for example, provides access to Redis data structures from Spark as Resilient Distributed Datasets (RDDs). Storing machine learning models in Redis can speed classification and other forms of processing while avoiding the costs of first loading model data into Spark.

Both MongoDB and Redis are offered by IBM on Bluemix—where, not surprisingly, you’ll also find Spark. After having heard these presentations, I’m itching to try these connectors myself; they’ll fit right in with one I’m already using to connect Spark and Cloudant. the cutting edge of Spark use

Several other presentations opened a window onto how Spark is being used. Josef Habdank of Infare Solutions talked about how to do ensemble model training in SparkML. As he did, he introduced a way of creating an ensemble of models through data clustering by breaking huge amounts of data into chucks. This example of online model training is especially relevant for those using extra-large data sets, who sometimes require hours to train a model as a batch. Josef also demonstrated ways of predicting trends through training based on the observation of 1 billion flight prices daily. His informative talk offered a great deal of practical advice as well as ideas for using big data to train online models.

Chris Pool and Jeroen Vlek, of Anchormen, developed a predictive model that uses sensor data to help it predict whether a switch will break, allowing the timely scheduling of maintenance and repairs for the Dutch railway system—the busiest in the world. To help avoid delays after a breakdown, this model uses time series data collected from rail switches to predict failures. To do so, however, it must learn which deviations in the data indicate an upcoming malfunction—the perfect task for Spark.

In his talk, Elsevier’s Reza Karimi talked about how to use Spark to analyze coauthorship graphs by building a mentorship model using data from publications listed in Scopus. Each edge in the model is a publication, and the nodes are authors. In such a situation, finding the most likely mentor for each author among coauthors is a ranking and classification problem. The result is an academic family tree that can be used to produce recommendations while identifying conflicts of interest among potential reviewers.

I also had the chance to hear Oscar Castañeda-Villagrán, from the Universidad del Valle de Guatemala, discuss the use of SparkSheet to transform Excel spreadsheets into Spark DataFrames. Doing so involves program transformation, which transfers data from one program into data that is compatible with another. In this case, code-to-code transformation is required—which in turn requires a grammar. XLParser, which is available for Excel, produces a parse tree that can then be used to generate code by writing a pretty-printer. To echo Michael Friess, imagine a world in which business analysts could use spreadsheets to prototype a data and machine learning pipeline for Spark.

Looking to the cloud for sentiment analysis

After talking to several summit attendees, I noticed that although many people have yet to begin working in the cloud, a great number are already looking at their options for doing so. Accordingly, at the Spark Summit, I presented a Twitter streaming demo that used the IBM Watson Tone Analyzer API and Spark to analyze in real time the sentiments expressed in Tweets. If you want to see the IBM Watson Tone Analyzer in action, you can try it out yourself after you sign up for the IBM Data Science Experience.