Apache Spark and the power of openness

Big Data Evangelist, IBM

Openness is where the world is heading. It’s the core principle in truly agile governance of a dynamic, complex and smarter planet. Openness means many things to many people. Fundamentally, openness is a productivity booster in every sphere of human endeavor because it accelerates reuse, sharing, collaboration and innovation within entire industry ecosystems. 

From the perspective of a big data ecosystem, we can break down the key dimensions of an open data environment into several layers including platforms, languages, tools, application programming interfaces (APIs), expertise and data. 

Open platforms and ecosystems

Open source initiatives are transforming all big data platforms, and open source communities are fostering vibrant solution-provider ecosystems to serve diverse requirements, both in well-established markets and in leading-edge frontiers. In software markets, open source has taken root pervasively. 

One of the fastest growing and most popular open source projects is Apache Spark. Initially developed at the AMPLab at the University of California, Berkeley, starting in 2009, Spark is a next-generation, cluster-computing, runtime processing environment and development framework for in-memory advanced analytics. Spark continues the deepening of open source platforms in the big data arena by leveraging and extending Apache Hadoop framework investments into machine learning, streaming data, SQL and graph analytics use cases. The Spark open source community continues to gain active members, and boasts over 465 contributors in a recent count. 

Languages, tools and APIs

Enterprise adoption of Spark, MapReduce, Apache Pig and the R analytics modeling languages is expected to continue to grow. Likewise, the industry is evolving toward open SQL specifications for fluid queries that span the full range of established databases and advanced big data platforms. Open languages, tools and APIs for all big data requirements are emerging, and Spark’s own SQL is also expected to gain adoption as support for that open source platform grows. The adoption of open standards—both those that spring from open source communities and those catalyzed through industry standards groups—are likely to accelerate as enterprises demand a common interoperability framework to unify disparate investments. 

Open expertise

Just as platforms, languages and tools are opening up, big data’s development ecosystem is opening up as well. Big data can leverage crowdsourcing—highly open cloud approaches such as Kaggle and TopCoder—to pool the world’s expertise, or at least the expertise of the smart people in your company and/or value chain. This expertise can be brought together in wide-ranging development, investigation and exploration of analytics and data-infused business problems from all conceivable angles. 

Open data

In the most radical form of openness, and in some circles the scariest form, open data seems to conflict with intellectual property rights in information, hence with a core monetizable asset of knowledge workers everywhere. But the push—both in the global culture and even among many in the public and private sectors—is to loosen controls on access, use and republishing of data without restrictions from copyrights and other legal and technical restraints. Open data is the focus of recent government initiatives such as, and the EU’s DOPA initiative. DOPA’s primary aim is interesting: to facilitate open data semantic transparency within so-called data-supply chains by spurring commercial development of technologies for automated data set detection, curation, entity linkage and advanced visualization. 

One recent milestone was the establishment of the Open Data Platform (ODP) group. This initiative centers on a common Hadoop framework around which its principals—Hortonworks, IBM and Pivotal—have aligned their respective solutions. The political impetus for ODP stems from an agreement among members that the benefits of cross-vendor Hadoop interoperability far outweigh any residual competitive advantages that any solution provider might otherwise accrue from forking core technologies in its respective product portfolio. 

Because rival Hadoop platform vendors have harmonized around ODP, the market’s core technologies have crossed the inflection point to commoditization. None of the Hadoop distribution vendors—not even those who haven’t yet joined ODP—are differentiating themselves based on tweaks or forks to the core technologies, which include the Hadoop Distributed Files System (HDFS), MapReduce, Apache YARN and Apache Ambari. Every solution provider has adopted the same versions of these core Apache codebases, while they all continue to differentiate through value-added tools, solution accelerators, cloud services and the like. Now evolving into its second decade since inception, Hadoop’s stability, maturity and widespread adoption have effectively standardized the core code across the industry. 

Some ODP principals have discuss the initiative’s importance. At IBM, Joel Horwitz says, “With such a diverse set of members, you can be sure there have been no modifications or tweaks to the codebase protecting your Hadoop investment for years to come. Hadoop was started through community-wide collaboration. Staying true to this spirit means the continuous inclusion of new members in the ODP, not only those who were first. Membership to the ODP requires dues just as any nonprofit organization would. In fact, you don’t have to look further than the Apache Software Foundation to see this [requirement] is the case with its tiered membership as well.” 

Shaun Connolly at Hortonworks isn’t afraid to invoke the word standardization in spelling out the value of ODP: “We believe innovation happens not in isolation but in collaboration. Aligning around a common core of Apache Hadoop means tearing down complexity and building interoperability across the Hadoop ecosystem. We are pleased that less than 60 days after its creation, ODP is driving industry standardization among Hortonworks, IBM and Pivotal platforms.” 

And Leo Spiegel at Pivotal highlights the practical value-add that ODP has already delivered: “The quick momentum we have around the ODP is already taking the guesswork out of fragmented and duplicative processes of what works and what doesn’t. This [approach] will enable enterprises and ecosystem vendors to focus on integrating and building business-driven applications and use cases that drive innovation.” 

Essentially, ODP signals that Hadoop is now a mature technology. I knew this day would come. As I stated in 2012 in a two-part IBM Data magazine article not long after I joined IBM, “The Hadoop market won’t fully mature and may face increasing obstacles to growth and adoption if the industry does not begin soon to converge on a truly standardized core stack. A Hadoop standards framework would provide a grand vision for the evolution of this technology in a broader big data industry context. Standards are essential so that vendors and users of Hadoop-based solutions can have assured cross-platform interoperability.” 

Openness and the road ahead

Where do we as an industry go from here? Now that vendors can reference a common ODP core, other industry initiatives that build on these foundation stones can evolve in their respective directions. HDFS and YARN, in particular, are leveraged by Spark and other open source projects. 

To get a head start on a Spark strategy, check out the recently released IBM Open Platform with Apache Hadoop (BigInsights) Version 4.0 software. Available since March 2015, Version 4.0 enhances a leading-edge big data platform with open source Spark. When extended with additional packages, the open platform helps data scientists, business analysts and enterprise big data managers to rapidly find, explore and model complex big data sets. 

To learn more about ODP and the Spark offering at IBM, please join IBM and our partners at Spark Summit, June 15–17, 2015 in San Francisco, California.

And before then, please join the Crowdchat, "What is Spark," on Thursday, May 21, at 11:00 a.m. ET. Use your Twitter account and start chatting with IBMers and other experts in the Spark community. 

This event will be well worth your while—big news is forthcoming.