Big data is an industry ecosystem in which the open-source approaches have great momentum. Open-source platforms—including Hadoop, NoSQL and R—are expanding their footprint in advanced analytics. As the enterprise Hadoop market continues to mature and many companies deploy their clusters for the most demanding analytical challenges, data scientists are migrating toward this open-source-centric platform.
Many people, even yours truly, have occasionally stated that Hadoop is the “Linux of big data.” However, the more I roll this thought around in my head, the less comfortable I am with it. How you draw the historical analogy between Hadoop and Linux depends on the frame of reference you apply:
- Open-source server operating system: This analogy is the weakest. Linux is clearly a server operating system in the old-fashioned sense of the phrase: a kernel that manages hardware resources, execution threads, utilities, applications, file systems, device drivers, I/O, networking and peripherals. Hadoop, however, is none of these things. Instead, it is an open-source distribution that either runs on top of various server operating systems or, through hardware virtualization layers, directly on the bare metal.
- Standard open-source software platform: This analogy is the strongest, though the terms “standard” and “platform” are so vague that you can twist them to your rhetorical ends. Linux, which got started more than 20 years ago, took around a decade to become a mainstream data center staple. Hadoop, which has been in existence for less than a decade, appears to be on an even faster track to ubiquity, accelerated by the fact that most IT professionals no longer regard open-source licensing as a risky proposition. Of course, the list of popular open-source software packages is long, but Linux was the bellwether for the open-source community as a whole in the ’90s, just as Hadoop (not the only popular big-data open-source package) is for its era today. It would be a stretch, however, to refer to either as a “standard.” Linux is still far from the only server operating system in the world, just as Hadoop is—and will likely remain—only one of a growing range of big-data platforms. And neither has ever been standardized through the efforts of an open industry forum.
- Diverse open-source industry ecosystem: This analogy rides directly on the previous one, but must be broadened in scope in order to reflect the heterogeneous reality of the big-data arena. Hadoop’s adoption has clearly spawned an ecosystem of complementary offerings: advanced visualization, statistical modeling, hardware acceleration, data integration, professional services and so on. But Hadoop’s ecosystem overlaps extensively with that of the many other big-data approaches—MPP EDW, NoSQL, in-memory, graph, etc.—in existence.
Some people claim that “distro wars” are likely to afflict today’s Hadoop arena, along the lines of what beset the Linux arena as soon as it started to commercialize. In that same vein, some argue that Hadoop vendors, such as IBM, are like to “lock in” their customers to “proprietary” distros. For example, the author of this recent article argues for both of those points.
I disagree with both arguments, for the following reasons:
- Open-source Hadoop licensing is widespread: Most of the leading vendors in today’s Hadoop market (IBM included) have adopted the core Apache Hadoop open-source distro in their products, have continued to refresh their products with the latest versions of all Hadoop subprojects, have not forked any of the Apache code, and have given customers the option of licensing the distro under an open-source license. If you’re interested in learning more about the licensing and other features of our Hadoop offering, IBM InfoSphere BigInsights, please click on this this link.
- Vendor extensions do not render the core Hadoop distro “proprietary”: To address features that the core Apache distro has historically lacked, most commercial vendors (IBM included) have built proprietary applications, tools and other extensions. However, vendors have not taken the core distros “proprietary” in the sense of specifically “locking in” customers so that they can’t, say, easily use any other Hadoop solution that incorporates the core open-source distro.
- Core Hadoop programming interfaces are consistent across industry: Most commercial Hadoop distros support the core programming interfaces (Pig, Hive, MapReduce, HDFS API) that allow cross-distro portability of applications, and, increasingly, SQL emitted by any application can execute on any back-end Hadoop platform. Even when vendors innovate alternatives to various layers of the Hadoop stack, they tend to preserve backward compatibility with these core programming interfaces.
- Principal commercial Hadoop offerings offer increasing customer choice: More commercial Hadoop distros have been designed to support diverse underlying storage layers (in addition to HDFS), diverse programming/execution frameworks (in addition to MapReduce), diverse query interfaces (in addition to HiveQL), and so on, offering customers and ISVs flexibility and choice
Invoking the “proprietary lock-in” and “distro wars” bogeys, as the aforementioned author does, totally ignores the reality of today’s Hadoop market. You never hear Hadoop users complain about being locked in to any commercial provider’s platform, nor has the industry balkanized into non-interoperable Hadoop versions.
The Hadoop industry has avoided the dreaded “distro wars” and will continue to do so. Openness and choice are most solution providers’ guiding principles, even as we strive to out-innovate each other in building the value-added components that enrich your Hadoop deployments.