Blogs

InsightOut: The role of Apache Atlas in the open metadata ecosystem

Frameworks for open metadata and governance

Distinguished Engineer, IBM Analytics Group CTO Office, IBM

In part 1 of this blog series, I described why we need open metadata. In this blog post I will cover how Apache Atlas could become the basis for an open metadata ecosystem. 

Introducing Apache Atlas

Apache Atlas emerged as an Apache incubator project in May 2015. It is scoped to provide an open source implementation for metadata management and governance. The initial focus was the Apache Hadoop environment although Apache Atlas has no dependencies on the Hadoop platform itself. 

At its core, Apache Atlas has a graph database for storing metadata, a search capability based on Apache Lucene and a simple notification service based on Apache Kafka. There is a type definition language for describing the metadata stored in the graph and standard APIs for populating metadata, from business glossary terms, classification tags, data sources and lineage.  

The start of an ecosystem

What makes Apache Atlas different from other metadata solutions is that it is designed to ship with the platform where the data is stored. It is, in fact, a core component of the data platform. This means the different processes and engines that run on the platform can rely on its presence and have confidence to use its services in their operation.

In the Hadoop platform, the Apache Hive, Apache Sqoop, Apache Ranger, Apache Falcon and Apache Storm components already have integrations into Apache Atlas to populate metadata about the data they hold and the processing activity they perform. Learn more about cross component lineage and tag based policies. As more components populate the Apache Atlas metadata repository, there is a network effect, creating greater understanding of the activity on the platform and a greater motivation to contribute and use its contents.

Building on the foundation

Collecting metadata is important, but in a complex data environment it is action that counts. For Apache Atlas to live up to its potential, it needs some additional frameworks to enable automatic metadata capture and the active management of data and related assets on the platform.

The first framework is the open discovery framework. This supports analytics that can investigate, classify and characterize the data stored on the platform and add metadata about it to the metadata repository. These discovery analytics functions are plugged into the open discovery framework. They execute in a pipeline, each feeding off of the results of discovery analytics functions that ran before them and adding new insights in their turn.

The open discovery framework can be triggered by events in the data platform, such as the arrival of new data, or by a scheduler providing regular scans of the data.

Next is the governance action framework. Similar to the open discovery framework, the governance action framework is triggered by individual events, or via a scheduler. It is responsible for executing the prescribed governance actions whenever certain situations are detected. For example, if sensitive data is being copied into an unsecured repository, the governance action framework may automatically mask or filter out the sensitive data.  

Some actions are performed inline and others may just be triggered with an asynchronous notification if the processing may take some time, or requires a person to review the situation.

Connecting the frameworks together 

For these frameworks to be triggered at the appropriate time, Apache Atlas needs to be called whenever data is accessed, created, changed or deleted. At these times, it needs to be passed details of the data being manipulated, what that data represents (this is described in the metadata) and details of the process and person that is issuing the request. An effective place to add these calls is in a connector framework.

Connectors are components used to interact with a particular type of data stored in a data store. They provide a programming API for manipulating the data that is compatible with the programming language used by the calling component, and they handle the network calls and any formatting of data necessary to exchange data between the data store and the calling component.

An example of a connector is the Java Database Connector (JDBC). This provides a Java programming language interface to access the data in a relational database using standard SQL calls. It also has an interface for extracting the metadata (schema information) for the data as well.

Apache Atlas needs a connector framework supporting connectors that provides access to both data and its corresponding metadata, just like JDBC. However, this connector framework has three features that make it special.

  1. Existing standard technology connectors, such as JDBC, can be embedded in the connector framework, making use of existing technology and access methods that the developers are used to.
  2. The Apache Atlas connectors support the data sets and related assets in appropriate ways for people and tools to consume them—not in the way they happen to be stored. This means the connectors reflect the assets that are meaningful to govern from the organization's point of view and provide the perfect place to trigger the governance action framework and open discovery framework.
  3. The metadata returned by a connector directly corresponds to the asset that the connector is accessing, but it is not limited to technical metadata. It is possible to retrieve any type of metadata that is linked to this asset from classifications, business glossary terms, governance requirements, lineage and the usage history of the asset. This means, when tools use the connectors, they have easy access to all of the metadata about that asset to guide their user in its use.

Connectors of different types can be embedded dynamically inside each other to support access to a hierarchy of increasingly sophisticated assets. Whenever a data user or programmer wants to access an asset, they use the connector framework to locate the appropriate connector. Through this connector, they can manipulate the contents of the asset as if it were located as a single object stored in one place, irrespective of where the pieces are located.

For example, consider a situation where an organization is receiving and storing a feed of social media data. The process that captures this social media data may store each minute's worth of social media data in a separate file. During the course of a day, there are many files created. Now consider a data scientist wanting to work with a day's worth of data. They could use an Apache Atlas connector that supports a "daily social media collection." When they use the connector, it is as if all of this data is stored in one file. The connector selects the right set of files and serves up the data as requested. Governance rules relating to the use of the daily collection of social media can be enforced through the connector since it has the context of the whole interaction with the asset.

Recognizing that metadata has fuzzy edges

In information management, we often talk about metadata, business data, reference data, master data, models, schemas and ontologies as if they were clear distinct types of data that can be managed independently. In reality, the different is largely contextually and we need to blend this data together in different ways for different scenarios.

This means that Apache Atlas has to be open in many ways. Being open source and supporting appropriate open standards is one dimension. Open to all types of metadata is another dimension, which is achieved through Apache Atlas's extendable type system. The third dimension is that it is open to integrate with different types of repositories and data.  

For example:

  • One of the central tenants of governance is that each asset must have an owner. The metadata repository needs to record the owner of each data asset.  Owners are people and data about people is typically stored in a user registry or master data management system. Apache Atlas needs to support the linking of its metadata elements to appropriate repositories of data about people.
  • Many industries publish glossaries of terms, ontologies and standard schemas and data models. They are managed in data modeling and glossary tools. These definitions provide useful mechanisms to categorize, classify and structure the data assets—making models in general another type of artifact that Apache Atlas metadata must link with.
  • Quality checks often need to know what are the valid values of a field. Often these definitions are managed in reference data hubs. The Apache Atlas business glossary terms describing data fields would be enhanced to be able to reference these sets of values. Similarly, the reference data stores would be enriched if the reference data could link to the Apache Atlas business glossary terms their data represents.
  • Governance actions need to execute governance rules. There are many types of rule engines available and Apache Atlas should be open to link governance rule metadata with rules defined in an external rules engine.
  • Policies, particularly those that relate to regulations, are complex to model with many interlinking concepts and clauses in related legislation. Modelling of policies in regulated industries is often managed in specialist tools to create an inventory of obligations. In these situations, the policies in Apache Atlas need to be little more than markers that link to the appropriate policies in the inventory of obligations.
  • Management and use of data assets needs to be measured to show the effectiveness of the governance program. Apache Atlas needs to become the source of this measurement data, populating historical data marts to support reporting and analytics for governance.

All of this suggests that any metadata stored in Apache Atlas, that is each node (entity) of metadata and connecting edges (relationships), must be uniquely addressable from external tools with an identity that endures throughout its lifetime. This way tools and related assets can maintain permanent links to these elements through standard mechanisms such as Open Services for Lifecycle and Collaboration (OSLC) enabling all types of big data processing to get the maximum value from the metadata.

Conclusion 

Given that Apache Atlas is open source and free to use, it is effectively lowering the barrier to entry into metadata management. Any data platform, or application managing significant amounts of data could have an embedded Apache Atlas metadata and governance capability that would simplify their data management responsibilities and make the data stored more accessible and useful. In this blog we have covered a number of frameworks and capabilities that would enhance Apache Atlas to make it very attractive to data platform providers.

Stay tuned for part 3 of this blog series, in which I will describe how Apache Atlas and these new frameworks could operate to integrate data distributed across multiple systems located both on premises and in one or more cloud platforms.

Discover how IBM data analytics technologies can help you. Click here to see the next installment of this series or here to view the entire series.