Analytics and the cloud: The rise of open source

Executive Information Architect, IBM

This is the fourth in a series of blogs on analytics and the cloud. Read our introduction to the series. This blog concerns itself with the rise of open source software and how it is used for a whole host of analytical purposes. However, as will be seen in this blog, there are significant gaps in open source capability in this space and may always will be.

IBM has a long history of working in open technology. IBM has directly contributed to the open source movement's success. It was 20 years + ago that IBM's strong support for Linux with patent pledges and significant technical resources and investment in the face of some of the legal uncertainty surrounding its use that brought about a change in posture toward open source for many enterprises. Under the right circumstances, open source can be a compelling and trusted adjunct to proprietary software.

Thoughts on what open source really means

The rise of open source software and open source projects to support them has meant that such projects can be run by anything from a large global team of contributors to a single person. However single individual (or vendor) projects can become quite closed in their approach to governance, limiting contributions from others, which of course is far from ideal and not in the spirit of open source. There are also many examples of the consequences of investing in open source projects with a single owner. Facebook declared that it was going to discontinue Parse, leaving thousands of developers stranded. Apple acquired FoundationDB, discontinued providing downloads of the software, and said it would no longer provide support. These are just two reasonably recent examples.

Without an open approach to the governance of open source projects we find ourselves with a greater risk of vendor lock-in, or even project abandonment in some extreme cases leaving customers with no options for continued development of code except using their own in-house teams. Also, those in the open source project want a voice in the decisions, and if they feel that their voice is not being heard, projects have been known to fork. This typically has a detrimental effect on the ecosystem, which increases risk.

The reality is that open technology projects managed under open governance—true open governance as we have with organizations such as Apache, Eclipse, OpenStack, Mozilla, and Linux—are demonstrably more successful (by an order of magnitude), have a longer life, and are less risky than those projects that are controlled by a single vendor, or are more restrictive in their governance.

IBM has an approach to evaluate open source projects by looking closely at five aspects of the project:

  • Responsible licensing—Obviously, IBM look to understand the open source license that is associated with the technology.
  • Accessible commit process—IBM seek to ensure that there is a clearly defined process for making contributions that welcomes outside contributors.
  • Diverse ecosystem—IBM confirm that there are multiple vendors and ISVs that are delivering offerings based on the technology.
  • Participative community—IBM require that there be a process for contributors to grow their technical eminence in the community.
  • Open governance—IBM evaluate the governance model to determine whether it is truly open.

With these considerations in place, Open source projects managed in this manner will flourish to the benefit of all.

Let’s investigate some of the open source projects out there that help in the analytics arena. Note that the ODPi (Open Data Platform Initiative) is driving the interoperability across Hadoop projects and IBM’s own BigInsights adheres to these standards.

IBM and open source

Figure 1: IBM Open Platform

IBM have been heavily involved in many open source software components. The figure approach shows the breadth of involvement in data/analytics related activities ranging from data movement / exchange through to tools for the Data Scientist and developer. It’s an important point that within IBM we strive to build our solutions around open source to ADD enrichment AND differentiation whilst retaining all the benefits of remaining open source.

I’ll describe some of these components below and how they are used. I’ve selected these as the components that are often cited as being part of the core analytics stack in open source solutions


Has become the standard platform for storing data in a wide variety of formats. History may look unfavorably on Hadoop as there are many instances of poor returns or outright failure. This is partly in due to the platform being used for inappropriate activities. HDFS and map reduce have been seen as the ‘analytics panacea’ but of course one size doesn’t fit all! Often inappropriate analytic workloads have been applied to map reduce models (which is after all a disk based batch orientated model).

IBM’s own Data and Analytics Reference Architecture sees Hadoop as but one platform in a set of platforms that are required to handle all analytic requirements from traditional BI workloads to large exploratory queries to things like text analytics. This to me makes much more sense where a combination of tools is used to manage a combination of very different problems.

Wrapped around Hadoop are a whole set of tools that enables data to be transformed, managed secured and so on. Products such as Sqoop, flume, Falcon, Ranger Knox and Kafka all help to enable governance and Security. Ambari, Zookeeper and Cloudbreak aid the Provisioning, manging and monitoring of all the services.


Apache Atlas is an open source metadata engine that is based on the Titan Graph db. It is a standard for metadata management in the Hadoop environment and can also be used to import metadata from a variety of other sources (other metadata tools, and sources of metadata). However, it is still early days for Atlas and needs wider adoption – IBM are working with the open source community, other vendors and clients to accelerate Atlas’s capabilities. It’s worth adding that IBM’s own Open IGC (Infosphere Governance Catalog) as well as other vendor tools are also vying for this critical space and it’s unclear if any tool will gain superiority in the market. The advantage of open source is it’s just that – an open solution driven by the OS community to meet the demands of clients – all others who wish to use this software may add it to their own tooling and develop an open ecosystem that allows metadata to be easily shared across such platforms.

Apache Hbase

A columnar db that is modeled on googles Big Table. Its highly scalable, can work across distributed nodes (using Hadoop file system), and it is subject to the CAP theorem (Consistent, Available, Partition tolerant) which says you can only actually satisfy two of these properties. HBase is run over HDFS so of course is partitionable and exhibits strong consistency at the expense of availability sometimes. In other words, when data is distributed across multiple partitions and copies it can always be relied upon to be consistent across those nodes at the expense of availability in some instances. This normally takes the form of data being locked from users whilst consistency is forced across nodes. Often used as an operational database rather than an analytical one. IBM Big Match is based on HBase.


Another columnar database that supports efficient compression and encoding to help with (generally) analytic workloads where bulk data needs to be processed effectively. It’s a good choice to keep disk i/o reduced for read operations when queries only require specific columns ensuring only those columns and no other superfluous data needs to be moved.


Apache Solr is actually a web application that uses Lucene ‘under the hood.’ Lucene is a text search engine library that can be applied to just about any form of data file that is text based. It relies on an inverted index which actually classifies each term and describes where it exists in a document, moving from a page centric structure where we have pages with words on to a keyword centric structure where we have words that are linked to where they exist on pages.

IBM’s Cloudant use Lucene for full document searches of its database.


Apache Spark is fast becoming the analytics platform for the data scientist and beyond.

Apache Spark is split into 4 capabilities:

Spark SQL – Allows users to use the familiar SQL language to query all forms of data. There are at least 3 variations to being able to build SQL like queries, using resilient distributed datasets (RDD), DataFrames or DataSets with differing versions of Spark. Please see the Apache Spark guides for more information.

Spark Streaming – Similar to the streaming mentioned above but is actually ‘micro batches’ which enable streaming like functions until data becomes very fast flowing

Machine Learning – A set of machine learning libraries that can also exploit ‘R’ which offers Data Scientists a very rich set of analytical functions to work with (Classification models, regression, clustering, decision trees and many more)

GraphX – enables graph style structures to be built in memory for novel approaches to querying data that allows relationships between data to be more easily navigated. Basically GraphX extends the Spark RDD to create a graph style structure with properties attached to each vertex and edge. GraphX has basic operators to manipulate the graph models built such as joinvertices, aggregation or create subgraph. Can be used with all forms of data but often sits over persistent data stores such as Titan which is also a graph database.

IBM have invested heavily into this community, building a Spark Technology Centre in San Francisco and contributing System ML (Machine Learning Libraries) to the OS community.

IBM’s Data Scientist Experience (DSX) relies on Spark as its platform.


These are Notebooks a Data Scientist uses to build exploratory analytics that can rapidly combine and manipulate data from many sources. Notebooks actually support many programming languages but for the Data Scientist the favoured tooling is often Python, Scala or R. A notebook actually stores code, runs it, captures any output and any user notes made around each step within the notebook. So in this way all aspects of the work done is captured. This is generally stored to disk as a Json file.

Notebooks are an excellent way to capture your ‘workings’ and share those working with anyone else in your community.

Market perceptions

For several years, IBM has been recognized as a leader in open technology. Projects that are managed under open governance have been found to be more successful, have a longer life, and are less risky than proprietary projects. And the time really is now, IBM’s analytics tooling is viewed very favorably with analysts in the marketplace with Gartner and Forrester giving positive reviews in the last 6 months. The capabilities within the broad range of analytic tooling that IBM provides can satisfy clients’ needs in any of the Hybrid cloud scenarios that are developed (Public, Private, Dedicated, and Traditional landscapes). IBM’s commitment to Open Source is enabling it to create a new, sophisticated set of tools that can sit smoothly and simply against its existing world class analytics offerings.

For more on how IBM can help you get value from your data, visit our Analytics Platform page.