Blogs

Learning and laughter: Open analytics takeaways from Strata + Hadoop World

Post Comment
Big Data Evangelist, IBM

What will I remember best from the latest Strata + Hadoop World?

To be honest, my most interesting memory was only tangentially related to data science, Hadoop or technology in general. My favorite experience was when comedian Paula Poundstone called on me during her keynote on Thursday morning. The only reason she did so was simply that I was some geeky-looking dude in the front row surrounded by a smartphone, laptop, backpack, floor-strewn papers and so on. What I was doing at the time was my job—specifically, tweeting about it on @ibmbigdata.

http://www.ibmbigdatahub.com/sites/default/files/pp_jk.jpgLong story short: I told her I was blogging for a big technology company everybody’s heard of. When she asked for the specifics of what I was saying in my blog, I shouted out the exact words I had entered on my personal Facebook about 10 seconds before: “Comedian Paula Poundstone keynotes at ‪#‎StrataHadoop. Liberally drops f-bombs and insults audience for laffs.”

Needless to say, that got her started. For the next several minutes, Poundstone—who made no effort to disguise her disdain for high-tech—engaged me in an edgy dialog. While everyone else looked on, I responded to her relentless interrogation of what I was doing and why it mattered. I played straight man to her comic jabs, and I even slipped in a few laugh lines of my own. After the show, I met her backstage, assured her that I enjoyed her performance and got someone to take a photo of us together. What a great souvenir!

However, that’s not my only takeaway from Strata + Hadoop World 2016. Getting down to the business that brought me to San Jose for this event, what I really want to share is the core theme I saw everywhere at Strata: the rapid maturation of the open analytics industry. This trend was fully evident in announcements from IBM and many other solution providers on the expo floor. If you want a bit more depth into what IBM announced, I urge you to read the blog I published on March 31 from the event. Here's a summary of the relevant announcements:

  • A new Hadoop interoperability framework: ODPi, which IBM helped found and to which IBM remains a key contributor, released its first run-time specification, test suite and reference build.
  • A new Apache Spark mainframe-based solution: IBM released the z/OS Platform for Apache Spark, which enables data scientists to use Spark to analyze data in place on the mainframe, eliminating the need to perform extraction, transformation and loading of the data into external systems for analysis.
  • A new open source analytics partnering initiative: IBM inaugurated the Open Analytics Ecosystem program, under which we will provide incentives for partners to engage with us in the development of open source analytics tools and applications, with a key focus on Spark. I should note that many other solution providers made significant Spark product announcements at Strata. Check out this ADTmag article for more details about IBM’s announcements as well as those from Platfora, Microsoft, Tamr, Tableau and Altiscale.

Beyond the business of vendor announcements, there were plenty of educational takeaways for anyone who wanted to deepen their understanding of data science skills and practices. For me, the best of those sessions at Strata were on Tuesday, under the Hardcore Data Science series that was sponsored by IBM (but presented by experts from many organizations). Here are the specific discussions that stimulated me most:

  • Lessons learned from building real-life machine-learning systems: Xavier Amatriain, vice president of engineering at Quora, presented a compelling discussion on the importance of ensemble modeling in today’s algorithmically rich data science initiatives. Specifically, he made an insightful point about the role of ensembles in feature engineering. He described ensembles as “the way to turn any model into a feature.” When data scientists aren’t sure whether to use one or another of the alternative algorithms that might be up to the job, stated Amatriain, they can split the difference by using individual models that incorporate each of them, and then stack those models into ensembles whose predictive potential is greater than the sum of their parts.
  • Scalable ensemble learning with H2O: Erin Ledell, a statistician and machine-learning scientist at H2O.ai, expanded on the topic of ensemble modeling. She spelled out the need for ensembles very succinctly: “When a single algorithm does not approximate the true prediction function well.” And she discussed the use cases and advantages of the principal ensemble approaches—bagging, boosting and stacking—so even a non-data scientist like me could appreciate it.
  • BIDMach on Spark: Machine learning at the outer limits: John Canny, computer scientist and the Paul and Stacy Jacobs Distinguished Professor of Engineering at UC Berkeley, talked about ways data scientists can scale and optimize performance of their algorithmic models in diverse parallel-processing environments. The key performance considerations he highlighted included the complexity of the model’s feature space, the intensity of a model’s “compute-per-datum” requirements, and the sheer volume of data the model must ingest and analyze at various points in its lifecycle. As he explained it, data scientists need parallel-processing fabrics to execute these three principal types of models:

http://www.ibmbigdatahub.com/sites/default/files/stratahadoop_embed_1.jpg—      Very large feature spaces, such as analyzing objects such as URLs, userids and cookies
—      Very high compute-per-datum requirements—for example, deep learning models for image recognition
—      Extraordinarily large data set processing such as sensor-laden autonomous vehicle networks

But any speedups from parallel processing of algorithmic functions will be for nothing, he contended, if data scientists don’t optimize interprocessor communications overhead between processing nodes.

After the Hardcore Data Science sessions, I reached out to these and the other presenters, asking for copies of their presentations. I now have most of those on disk, and I'll eagerly peruse them on the flight home.

Do you want to deepen your own skills and knowledge in data science? I urge you to register for Datapalooza on April 28 in Austin, Texas. An ongoing series of community events for data scientists, Datapalooza gatherings in this and other cities provide all-day immersive experiences that help data science professionals enhance their skills in Spark and other open analytics tools. At the same time, you can learn how to build innovative data products that run on machine learning and other advanced analytics.