The case for open metadata

Distinguished Engineer, IBM Analytics Group CTO Office, IBM

Have you ever looked at the metadata associated with a photograph? It includes the technical settings of the camera, when the photo was taken, the date and time of the image and often the location. When the photo is transferred from your camera to a photo application, the metadata goes with it and is understandable by that application. More importantly, the photo application can understand the metadata from almost all cameras irrespective of their manufacturer, and you are free to choose your photo application because any photo application understands your camera’s metadata:

If you contrast this metadata-for-images example with the data that enterprises need, you see a very different situation. Many enterprises are challenged to locate, understand and use their data because it has no metadata associated with it. They may have tools that capture and maintain metadata in a repository, but this metadata is only for the private use of these tools—the format is proprietary. An enterprise using a variety of tools can end up with multiple islands of metadata covering a fraction of its data rather than an expanding landscape of valuable, openly accessible metadata.

Standards landscape

How can we make metadata management of data for enterprises as effective as photographic image metadata? Standards are of course a part of the answer, but literally hundreds of standards for different aspects of metadata exist. The problem is that each standard only covers a fragment of the total picture. We need a way to link these standards together to cover an enterprise’s data landscape and then ensure that this assembly of standards is consistently and universally adopted—a challenging project in such a fast-changing industry.

However, a well-used approach for creating industry stability and consistency through open source does exist. An open source framework that is built on standards, but also has a plug-in architecture to allow for innovation and value-add services, helps lower the bar for entry for many organizations while allowing commercial competition and incentives to contribute. The Eclipse ecosystem serves as an example.

An open source project

In May 2015, an incubator open source project called Apache Atlas was started to create an open source metadata and governance capability. The philosophy of Apache Atlas is that the metadata repository is embedded in the data environment, and therefore all data activity is captured continuously by default. No need for an expensive and error-prone process is necessary to populate the metadata repository after the fact.

Apache Atlas has already demonstrated the benefit of having an embedded metadata capability in the Apache Hadoop platform. As different components are being extended to log their data assets and activity in Apache Atlas, enabling the capture of lineage flows through multiple different processing engines running on the platform.

Can we repeat this success across the majority of data-processing platforms and transfer this captured metadata with the data as it is sent between data platforms? In particular, can cloud-based platforms embed Apache Atlas as standard because it is often challenging to keep track of data in a cloud service?

IBM has been investing in Apache Atlas to broaden its scope, both in the types of metadata it can support and in the ability to run on different platforms—particularly cloud platforms. We believe that data is too important to allow metadata management and governance to be an optional extra for a computing platform. Apache Atlas provides an opportunity to take an important step forward in the usefulness, safety and value associated with data-driven processes and decisions.

Suggested involvement the Apache Atlas project seems of interest, then the following list offers suggestions for how to get involved: 

  • Direct code contribution to the Apache Atlas project: Many features still need to be coded.
  • Research into automation around the identification, capture and maintenance of metadata: Automation keeps the cost of metadata management to a minimum and often improves its accuracy.
  • New standards for exchanging governance and lineage metadata among metadata repositories: This suggestion includes ways to encode metadata into data flows.
  • Encouraging vendors and partners: And this encouragement can include projects that are internal in your organization to embrace Apache Atlas and its standards to grow the ecosystem of data and processing that is assured by metadata and governance capability.

Metadata and its governance

For more information, take a look at InsightOut: The case for open metadata and governance. Steve Lockwood and I cover the technical details of the Apache Atlas strategy for IBM at IBM Insight at World of Watson 2016, 24–27 October 2016, in Las Vegas, Nevada. We look forward to seeing you there.

And I want to thank my IBM colleagues—Cassio Dos Santos, Dave Kantor, Jay Limburn, Albert Maier, Bhanu Mudhireddy, Ernie Ostic, Tim Vincent and Dan Wolfson—for their support in making open metadata and governance with Apache Atlas a reality.

Discover Apache Atlas at IBM Insight at World of Watson 2016