InsightOut: The case for open metadata and governance

Distinguished Engineer, IBM Analytics Group CTO Office, IBM

In his blog, “InsightOut: Metadata and governance,” Tim Vincent described how important metadata was to an organization’s effective use of data.  In this blog I explore why it is so hard to collect and maintain this metadata despite the wide range of metadata-enabled tools. It seems that today metadata exists only in silos using proprietary repositories and formats. The result is that metadata is not transferable between vendor’s products, requiring manual intervention and the use of complex tools in order to create an enterprise view. What causes this situation and what do we need to do to change it?

The metadata dilemma

Metadata describes data in all its forms. This includes where the data is located, how it is stored, how frequently it is changing, what it represents, how it is organized, who owns it and how accurate it is. 

When good metadata is available, the data it describes can be rapidly located and assessed for new applications and analytics. Without metadata, any data-oriented projects can be delayed while the team searches for the data they need and even when they find some potentially useful data, they are not sure what it represents and how it should be interpreted. In many analytics projects, this process can consume over 70 percent of the project resource whilst potential insight that comes from the data is delayed and opportunities are lost. 

So why do so many organizations find themselves without the metadata they need for their data-driven projects?   

Mainly it is because it is just too hard to collect and manage in a way that is useful. Metadata, where it exists, is locked into specific tools and platforms in proprietary formats designed to support just the operation of that technology. 

For example, a Business Intelligence (BI) tool maintains a rich metadata description of the reports required for an organization and how to map the reports to queries on a wide range of data. Through the process of defining reports in the BI tool, the metadata created represents a description of the business requirements for data along with the terminology used to describe this data.  However, the report building process is slow and expensive.  Knowledge about the data feeding the reports has to be extracted from the applications that create and maintain the source data and encoded in the BI tool. 

Similarly, a relational database maintains metadata about the structure (schema) of the data it stores. The schema is used to query the data and provide hints on how to optimized the access to information. However it does not cover what the data means, how it should be managed, who owns it and how long it should be kept. This information is locked into the application code that is maintaining the data in the database. The database metadata cannot be extended to capture this additional information that would be useful to other consumers of this data.

Effort spent maintaining metadata is rarely transferable between tools. Many business users today want the freedom to leverage their tool of choice, especially if different types of analytic are needed. When a different part of an organization selects a new tool, the metadata to drive it often has to be recreated because little of the metadata in the established tools can be harvested. 

Other tools that typically support metadata are ETL tools—using metadata to capture information that describes the systems they are integrating. Often this is a technical view, with an ability to tag data with business terms. This metadata improves the speed and quality of the integration process, but again, there is little opportunity to acquire this metadata from other tools and systems already installed in the environment, so the integration team is starting again, building their own metadata knowledge base from scratch.

So now we get to the products that specialize in metadata management. Often these are installed into a mature and complex environment.  The lack of standardization means they need specialist bridges and brokers to extract metadata from the wide variety of tools and systems—a huge manual effort—and hard for the vendors of these tools to keep up with the constant innovation in tools and data platforms today.

Even once the specialist metadata tools have collected the metadata; there is then an ongoing cost to maintain it. These tools are outside observers and are not automatically updated as the landscape changes. The result is that they are often out of date and quickly cease to become a trusted source of metadata.

As an industry we have a lot of experience in using metadata to describe and reuse data. However the organizations that make use of our metadata-driven tools do not get the value they could from them because the metadata is closed and locked away in proprietary formats and repositories that sit outside of the systems that create the data.

A new metadata approach is needed

As the data used by an enterprise grows in size, variety and importance, it is no longer acceptable that the gathering and maintenance of metadata remains an under-funded and neglected afterthought for data-driven organizations. Metadata management needs to become a key focus of an organization's data and analytics strategy.

Having said that, the practices around metadata management also have to change to meet the challenges of the modern business. It cannot be an offline source of information for the IT teams, rarely consulted and gradually decaying as systems evolve.  

The need for change is urgent. Cloud environments are introducing new technology and represent new silos of information. It is becoming impossible for any vendor to keep up, or get access to the increasing variety data locations and technologies.

Metadata does not need to be passive either. It has a role to play in active, automated governance of data for quality, privacy and lifecycle management. It can also drive a business friendly information virtualization layer that simplifies the structure and naming of data assets.

A new metadata manifesto

Enabling metadata to become an active part of the organization's operation—providing insight into an organization's data on demand and in context of the tools and systems that are making advanced use of data the following is required.

  • The maintenance of metadata must be automated to scale to the sheer volumes and variety of data involved in modern business. Similarly the use of metadata should be used to drive the governance of data and create a business friendly logical interface to the data landscape.
  • The availability of metadata management must become ubiquitous in cloud platforms and large data platforms, such as Apache Hadoop so that the processing engines on these platforms can rely on its availability and build capability around it.
  • Metadata access must become open and remotely accessible so that tools from different vendors can work with metadata located on different platforms. This implies unique identifiers for metadata elements, some level of standardization in the types and formats for metadata and standard interfaces for manipulating metadata.
  • Wherever possible, discovery and maintenance of metadata has to an integral part of all tools that access, change and move information.

We need to lower the bar to entry into the metadata management space to broaden the number of platforms that support it by default.

Advocating an open approach to metadata

So what is the most effect way to proceed with this manifesto? There are a great number of metadata standards, but few have been widely implemented and none span all of the types of metadata necessary to govern and manage data. Most vendors, including IBM, have preferred to use proprietary formats and APIs. These proprietary repositories will not go away because too much has been invested in the associated tools and ecosystems around them. We need an implementation that is accepted as a basis for metadata management in new platforms and that can establish de facto standards for metadata that vendors with proprietary repositories can integrate with.

Open source is the obvious way to create an implementation with industry consensus around this common metadata capability. Apache Atlas is the lead contender as the choice of such an open source project. It is relatively new and still in the incubator stage. This means it can be evolved to become the metadata baseline we need.

So what would an open source metadata management capability look like? In the next blog I will cover how Apache Atlas could provide the core for an open metadata ecosystem.

Discover how IBM data analytics technologies can help your organization. Click here to see the next installment of this series or here to view the entire series.