Hadoop Summit 2015, Day 2: Managing Hadoop's indistinct market boundaries

Big Data Evangelist, IBM

The second day at Hadoop Summit was a lot like the first. Come to think of it, this year’s Hadoop Summit was a lot like the past few years’ summits. The experience wasn’t déjà vu, just simply the sense that the same core themes continue to shape this emerging industry segment as it pushes toward broader adoption.

Here, I’ll purposely avoid the standard theme of Apache Hadoop positively transforming the world. That theme was the gist of keynotes on day one from Hortonworks, Microsoft and Forrester Research. You don’t have to be cynical to recognize that this theme is pretty much the lead message the industry uses to sell every hot new information technology, not just big data analytics.

Hitting pay dirt dominant theme of the keynotes on day two was the progressive hardening of enterprise-grade Hadoop. That theme was the explicit focus of keynotes from EMC, Hortonworks and SAP. In different ways, each inspired us with discussions of how Hadoop is maturing, gaining acceptance, growing more standardized and being engineered through and through with security, manageability and other essential tools.

On day two, EMC, Hortonworks and others presented this theme just as any industry puts forth similar presentations when its sales pitch is hitting pay dirt, but is still having to prove itself in the eyes of some potential buyers. However, another unspoken theme was at work in this year’s Hadoop Summit, and it’s even more obvious than in previous years. Quite simply, the Hadoop market’s boundaries remain indistinct, overlapping considerably with adjacent NoSQL segments and growing fuzzier as new subprojects compete to be mentioned in the same breath with Hadoop’s undisputed core: Hadoop Distributed File System (HDFS) and MapReduce.

Next week’s Spark Summit, which will feature one of the most promising projects that the Hadoop community prefers to think of as one of its own, shows that this market is growing softer around the edges every day. On the eve of day one, these soft market boundaries were laid bare in my recent blog on the Open Data Platform (ODP) meetup.

At its heart, ODP is an attempt to define a hard but potentially expandable core of subprojects at the center of what everybody agrees is Hadoop. At the meetup, attendees were asked to nominate the projects they want added to the ODP core. Their responses—Apache HBase, Apache Hive, Apache Kafka, Apache Oozie, Apache Spark, Apache Sqoop, Apache Storm, Kerberos and Apache Zookeeper—barely hinted at the broader list of Hadoop ecosystem projects forming the substance of many sessions at this year’s conference. In my day one blog, I listed some of the other projects—Apache Drill, Apache Flink, Apache Flume, Apache Mesos, Apache Tez and Hadoop YARN—that have equal or better claim to the essential Hadoop-hood than those nominated in the meetup.

Reining in frontiers

I’ve been rolling these contrasting themes around in my head, and they seem to be potentially in direct conflict. You can’t easily stabilize, harden and manage a moving target. If no two Hadoop deployments incorporate the exact same set of software components, defining standardized reference implementations becomes difficult. And if such reference implementations are undefined, then building, deploying and administering the requisite performance, availability, security and other necessary tools become difficult and expensive for vendors and end users.

And all these contrasting themes are in possible conflict with the theme of Hadoop’s potential transformative power. If deploying and administering the myriad combinatorial varieties of Hadoop becomes an expensive ordeal, enterprises will require more resources than ever in IT to keep it all under control. That requirement would result in fewer resources being available to business functional groups for development of the Hadoop-fueled applications that deliver value.

Nobody is suggesting that the industry discard any of these other projects until their devotees have had their chances to gain broad acceptance for them. And future Hadoop Summits would be boring affairs if the sessions focused only on a few core projects and only dished out industry-consensus viewpoints on how to manage them.

Speaking of exciting new frontiers in the Hadoop ecosystem, register for Spark Summit, June 15-17, 2015 in San Francisco, California, to take a deep dive into Spark. In addition, join data scientists and other big data professionals June 15 at Galvanize, San Francisco for a Spark community event. Hear how IBM and Spark are changing data science and propelling the insight economy. Sign up to attend in person (or watch the livestream) and to receive a reminder notification on the day of the event.