Spark Summit 2015, Day 2: A gathering of today’s most dynamic data scientists

Big Data Evangelist, IBM

Data scientists rule this new world of big data analytics. I’ve said so many times before, and the past two weeks have only confirmed the conviction in me.

Since Monday of last week, I’ve been in northern California for a series of industry events featuring the innovative work of today’s most dynamic data scientists. I started with Hadoop Summit 2015 in San Jose, which I documented in blogs for day 1, day 2 and day 3, respectively. I then made my way up the peninsula to San Francisco, where I spent the weekend at the campus of IBM partner Galvanize, participating in the very stimulating Spark meetup that IBM sponsored.

That alone was plenty to absorb, but this past Monday amped it to another level of intensity. Monday morning, I was part of the standing room–only crowd (most of them data scientists) that participated in day one of Spark Summit 2015, which took place at the San Francisco Hilton near Union Square. Just before lunch, I walked rapidly from that venue through the busy downtown to the Spark community event that IBM sponsored at Galvanize. There we discussed IBM’s Spark announcements, which had hit the wires earlier that day. The community event lasted well into the evening. Take a look at my blog post covering everything I experienced on Monday: the Spark Summit, which is still going on as I write this, and the community event, which lasted for a single action-packed day. Tuesday, I plunged right back into Spark Summit—which, if anything, was buzzing more vigorously with interesting content than it had been the day before. Not surprisingly, IBM’s Spark announcements were the talk of the show—though, in all fairness, MapR, Cloudera, Hortonworks, Amazon, Databricks and Intel, among many others, had important news to announce as well. I’m looking forward to next week, when, the organizers assure us, the slide presentations will be available for download. And a good thing, too! That’s far too much fresh information for any one of us to absorb in its entirety in a single sitting, even were it physically possible to attend all sessions in parallel.

I took a great deal away from the general sessions on the morning of day two of Spark Summit.

First, the Spark community sat up and took notice of the depth, breadth and seriousness of IBM’s strategic commitment to Spark. This was very obvious in the exhibition hall looking at the crowds at our booth, as well as in my inability to take two steps in any direction without someone wanting to discuss what IBM is doing. I lost count of how many bottles of water I drank to keep my throat from drying up during all the conversations.

Another takeaway from the event was that the community felt positive about IBM’s recognition of Spark’s importance. But it was evident in the slightly apprehensive manner in which any fast-growing niche, populated mostly by startups, greets the entrance of established, diversified solution providers. This tone was implicit in Ben Horowitz’s half-joking characterization, from the main stage, of IBM as the “rabbi” that has suddenly anointed Spark. When one of Silicon Valley’s premier venture capitalists singles us out as having moved the needle on Spark’s maturation into a robust segment, we can’t help but be flattered.

But what I’ll probably remember most from day two of Spark Summit is a bit geekier than clever quips or expo-hall gladhanding. What I’ll take away are the key points discussed by Databricks principal Reynold Xin on the technology roadmap for evolution of Apache Spark. Considering that Databricks is the single largest committer to that open-source project (and a key IBM partner in doing so), its words carry great weight.

Essentially, Xin discussed two parallel workstreams in Apache Spark’s evolution. On one hand, he discussed plans for expanding Spark’s flexibility as a programming environment. Specifically, the range of programming language bindings for Spark’s new DataFrames API is being expanded beyond Java, Scala and Python to include R (already in the newly released Apache Spark 1.4) and other languages in the not-too-distant future. This will allow Spark applications written in any of these languages to be compiled to optimized code by any compliant compiler.

Another key workstream in Apache Spark is Project Tungsten, which focuses on speeding Spark code execution on diverse hardware platforms. Plans call for improvements under Project Tungsten to support optimized execution of compiled code (written to DataFrames and other Spark APIs) on FPGAs, GPUs and other hardware substrates. Key focus areas under Project Tungsten include improving Spark’s memory management, binary processing and garbage collection capabilities, as well as exploiting memory hierarchies more efficiently for cache-aware computations and improving code generation to take better advantage of modern compilers and CPUs.

I’ve only scratched the surface here, of course. As soon as I can download every presentation deck from Spark Summit, I’ll dissect it further. In case you haven’t noticed, I’ve gone very deep on Spark in the virtual pages of the IBM Big Data & Analytics Hub over the past six weeks. We all have—just take a look at the most recent BD&A blogs on the topic.

Stay tuned for continuing, deepening coverage of this important new technology. We will cover it both at a business level and in considerable technical depth. You will see streams of Spark commentary, guidance and technical depth both in BD&A Hub going forward and in our new Spark Technology Center blog.

And please sign up for IBM’s forthcoming Apache Spark as a Service on Bluemix.