Highlights from day three of Strata + Hadoop World 2015

Post Comment
Big Data Evangelist, IBM

This is my fourth or fifth Strata conference (I’ve lost count).

I can’t help but think back to the event’s genesis earlier this decade as the spearhead for a new, Silicon Valley–centered approach to web-scale data analytics. What’s clear from the three days of this year’s New York City–based installment of this now-global event is how it’s grown in scope well beyond Hadoop (as a technology or community) and even beyond Spark. It’s become the premier industry conference for anybody who considers themselves a working data scientist.

I wouldn’t be surprised if O’Reilly takes “Hadoop” out of the name of the event going forward, though there’s little chance they’ll de-emphasize the Hadoop-related content; it remains the rock-bottom foundation of the data science revolution. In my Day 1 recap blog, I began with a look at one of IBM’s core engagements at this event: the three-day “Practical Data Science” hands-on lab we’re leading here at Strata. When the high-minded words of the keynotes fade from attendees’ memories, and the brochures and swag from the expo floor are forgotten and unused, the new skills they gained—thanks to IBM—will be our biggest contribution to this particular community event.

IBM’s contributions to the data science community take many forms. In his Day 3 keynote, Shivakumar Vaithyanathan, IBM Fellow and Director of Watson Content Services, focused on our open sourcing of the System ML machine learning library for the benefit of the Spark community. He pointed out the significance of this contribution for working data scientists where it matters: productivity. He noted that a typical machine-learning model that takes 1,500 lines in MapReduce to program would require only 10 lines to program in R on System ML. That’s an order-of-magnitude productivity booster for data scientists working in Spark, which now occupies the spearhead position as flagship data-science tool for the next wave of developers.

Vaithyanathan also discussed the power of "declarative" machine learning in System ML. As he put it, “You tell the machine what to do, and the machine figures out how to do it.” Clearly, automation is a huge productivity booster for working data scientists who need to build, test and support a growing range of ML models on a growing amount of big data under tight time pressures.

Of course, Vaithyanathan wasn’t the only Day 3 keynoter. There were also stimulating talks from the CIA, the Alfred P. Sloan Foundation, MapR, Platfora, SAS, The Difference Engine, Datakind, and The New Yorker/Mastermind. Stay tuned to the conference website for those video playbacks.

I rounded out my Day 3 morning by attending the session “Launch new financial products with confidence,” presented by Beate Porst, IBM product manager for InfoSphere DataStage, and Anand Ranganathan, director of solutions with IBM Business Partner Unscrambl, who discussed Thomson Reuters’ deployment of IBM Streams for financial insights and context-aware stream computing.

It was a very practical discussion of data governance challenges in a real-time environment. As I noted in the Day 1 blog, IBM has recently stepped up its data governance emphasis in the big data arena with the BigInsights BigQuality offering for Hadoop. Check out information on this new solution today.

To accelerate your career journey into advanced analytics, Streams and Hadoop, explore this informational resource page at IBM Analytics. Also, check out the the IBM integration and governance portfolio.

And don’t forget to register for IBM Insight 2015, 25–29 October, in Las Vegas.