Building a cognitive data lake with ODPi-compliant Hadoop
For today’s data scientists and data engineers, the data lake is a concept that is both intriguing and often misunderstood. While there are many good resources about data lakes on ibm.com and other websites, there is also a lot of hype and spin. As a result, it can be difficult to get a clear understanding of the challenges, opportunities and methods that can help companies build data lakes that deliver real business advantage.
We recently listened in to a fascinating conversation between John Mertic, Director of Program Management at the ODPi, and Neil Stokes, Worldwide Analytics Architect Leader at IBM. Putting the data lake into the broader context of today’s IT industry trends, they discussed the importance of open, interoperable data and analytics platforms in solving both traditional analytics and cognitive computing challenges.
Here are the top five things we learned from Neil and John:
1. Data lakes need to be defined by consumption patterns, not data types
There’s a school of thought that defines a data lake as a platform or set of tools for storing and analyzing large volumes of unstructured data. This definition implies that data lakes do a fundamentally different job from systems that manage and analyze other types of information, such as traditional relational database data.
Neil argues that this is a misconception. There is no such thing as “unstructured data” – there is only data whose structure has not yet been parsed. Even if you are analyzing tweets or Facebook posts, you have metadata about when and by whom the text was written, and the text itself will contain semantic patterns from which you can infer meaning. For example, a tweet that includes certain words or hashtags can be understood as referring to specific topics, themes or sentiments. If this data were completely unstructured, trying to analyze it would be a fruitless exercise, because without structure, language has no meaning.
Since the line between “structured data” and “unstructured data” is blurred, there is no reason to think that a data lake should include some types of data and exclude others. In fact, the value of the data lake concept is that it should allow you to store any kind of data, and analyze it for any purpose.
For this reason, it makes much more sense to define data lakes in terms of consumption patterns. What is the organization trying to achieve? What kinds of data will it need to analyze to meet these objectives? And therefore, what kind of analytics infrastructure does it need to build to support that analysis? Every data lake will be different, depending on what data the organization has, and what it decides to do with it.
2. It’s not all about Hadoop
In consequence, a related assumption – that “data lakes are built on Apache Hadoop” – is equally questionable.
Certainly, we should not underestimate the importance of Hadoop to the design of most data lakes. As a general-purpose platform that can handle almost any type of data that you can throw at it, Hadoop is almost certainly going to play an important role.
However, most data lakes are likely to be built using a combination of many different tools. A traditional data warehouse could be just as much of a cornerstone of such architectures as Hadoop is. Each of these tools needs to be able to work in harmony with its peers in order to build flexible data pipelines that can deliver whatever the business needs.
3. Interoperability is key
The ability to build data pipelines between tools depends on the ability of those tools to interoperate with each other. Historically, Hadoop has been a difficult platform to integrate reliably with other tools, because it consists of a collection of independently developed open source projects that evolve at different speeds. In the past, this greatly increased the risk of compatibility issues, and made it difficult to integrate reliably with third party tools.
The work of the ODPi, supported by IBM and many other leading providers of Hadoop distributions, has already made great strides towards improving this situation. Open, standards-compliant Hadoop distributions make it possible for vendors to build tools that can harness the power of Hadoop and Spark without worrying too much about integration problems.
As a result, it is becoming easier than ever before to build data lakes based on hybrid analytics architectures that can coordinate on-premise and cloud-based technologies, combine relational and non-relational data, and solve almost any problem.
In particular, Neil raises the point that cognitive analytics techniques such as machine learning and natural language processing stand to benefit from this type of general-purpose analytics landscape. Typically, cognitive techniques depend on training an artificial intelligence (AI) on a corpus of data – and the larger and more varied the corpus, the better the results.
By removing the barriers between types of data, and making it easy to feed the AI with the data it needs to learn and evolve, data lakes lay the foundation for cognitive analytics strategies.
4. Standardization enables innovation
The standardization of technologies like Hadoop is also important for users, because it helps them avoid being locked in to a specific vendor’s implementation of the technology. This is not only important for commercial reasons – it also reduces the risk of investing in what may turn out to be a dead-end version of the technology.
The same is true for vendors. If enough customers are using an ODPi-compliant version of Hadoop, then it becomes worthwhile to invest in developing innovative tools that integrate with that version of Hadoop, because they know that there is a market for such tools. This can help to accelerate the pace of innovation, particularly among smaller vendors who would not be able to afford the risk of developing applications for what might turn out to be a niche proprietary platform.
The fact that standardization encourages innovation from smaller vendors might be seen as a good reason for larger vendors to resist it. However, most of the major players, including IBM, have embraced ODPi compliance instead of opposing it.
John believes that the fact that IBM and other vendors are members of ODPi demonstrates their understanding that “a rising tide lifts all boats”. And as Neil points out, IBM has been supporting open source software for many years, from its work on the Eclipse development tools in the early 2000s to its current role as one of the leading contributors to Spark and other open source projects.
5. Democratization requires good data governance
Neil believes that the next key challenge that faces the analytics community is data governance. As the emphasis shifts towards democratizing access to data and empowering business users to analyze it for themselves, the role of data scientists and engineers will shift too. Building pipelines that assure data quality, auditability and lineage will be their main focus – and this is going to require robust and versatile tooling.
Governance and metadata frameworks such as Apache Atlas offer a lot of potential, but standardization will be required to help ensure that Atlas and other tools can interoperate seamlessly and effectively. Otherwise, data lake initiatives are likely to suffer from the same “garbage in, garbage out” problems that have plagued previous generations of analytics technologies.
Neil thinks that the successful example that ODPi has set in the Hadoop ecosystem needs to be followed by vendors and communities working in the data governance space. By learning the lessons of the past and following the good practices espoused by the ODPi and its members, there is a great opportunity to finally realize the benefits that data lakes have promised for so long.