InsightOut: Metadata and governance

It may sound boring, but it holds the key to self-service

IBM Fellow and VP, CTO for Information & Analytics Group, IBM

Traditionally metadata has been the province of data architects in the IT organization—not something likely to provoke excitement in the organization at large. Not surprisingly, then, effectively articulating the value of metadata to the business has always been difficult. Indeed, governance has suffered similar issues: Although teams concern themselves with classifying data and the policies and rules governing data’s use, most such efforts are seen as merely providing documentation for regulatory and audit purposes. Typically, the metadata conversation doesn’t even make it to the boardroom table.

Recently, however, as businesses have begun demanding ever greater access to data, metadata, once thought boring, has helped IT create data-driven enterprises. Indeed, a data driven-organization needs easy access to usable data—supplied through a strong self-service paradigm that allows users to easily locate appropriate information presented in understandable ways that aid the production of derivatives while ensuring that people understand where and how that information came into existence. Yet all this is possible only when a strong metadata foundation maintains reliable descriptions of the data.

Accordingly, to satisfy regulatory demands and to bolster trust in data use, ever more agile models must define governance rules, linked through metadata to mechanisms that automatically enforce these rules. To understand this, let’s drill deeper still into what we mean when we talk of metadata, exploring metadata’s relationship with governance and examining how the creation of a shopping experience based on metadata has become a central tenet of the self-service model.

Putting metadata at the center of the information lifecycle

Much has been made of how the five personas interact with the information lifecycle, turning the attention of many to the collaboration model that connects these personas—and the importance of trust to the whole process. The metadata layer helps support these capabilities by maintaining a shared understanding of data and related assets. By doing so, it acts as the foundational layer enabling both governance and the shopping experience.

What metadata, then, should we keep? The metadata layer stores three types of information, broadly: (1) descriptive information about assets, (2) governance policies for, classifications of, metrics for and rules about those assets and (3) information tagging for both. Accordingly, descriptive information on assets can include the following:

  • Location and connection properties for the asset.
  • A technical schema description of an asset, when available. Consider, for example, a database schema describing the schema, tables and table columns that make up a database. The schema for some assets is initially unknown but can be discovered as the asset is explored. Other assets might be wholly without a schema—that is, unstructured.
  • Detailed information about the asset, automatically detected and derived by discovery analytics.
  • Business term and model definitions that act as formal asset information tags, having been verified by subject matter experts. Business terms are organized into business objects and attributes, providing a basis for data-driven security and business-friendly interfaces to data. This method of organization enables self-service from the business side, whose personnel are not trained to parse and consume technical descriptions.
  • Linkage between business terms and a technical schema or asset.
  • Linkage between governance classifications and either business terms or a technical schema or asset. Governance classifications help group assets that need similar management approaches, covering aspects such as confidentiality, integrity, retention and other governance activities. Linking these classifications to governance policies and rules dictates the management of assets that are classified together.
  • Design and runtime lineage, including the actual lineage flow that could be executed (design time) and the actual lineage associated with specific values in a data set.

Governance policies and rules describe appropriate asset use, valid field values, masking rules, permission rights and retention periods for each governance classification. Information tagging uses a comparatively free-form model that supplies additional descriptive information about assets from the perspective of the people who use those assets.

Adopting an open model for metadata

Clearly the industry is pushing technology toward an increasingly open model, driven by open-source technologies and open APIs. The days when industry standards were regulated by a governing body are effectively gone, and metadata is no exception. An open model for metadata, covering APIs, metadata types and frameworks for driving metadata management and governance, can bring together an end-to-end view of a heterogeneous, hybrid cloud landscape of systems delivered through a mixture of different vendor technologies and open-source technology. In its absence, a comprehensive self-service bi-modal IT model will remain only a dream.

IBM is beginning a move to open metadata capabilities based on Apache Atlas in what is called the Open Metadata Service (OMS), using open metadata capabilities to foster an analytics platform ecosystem in which vendors can use an open metadata access API to store and retrieve information. But why should this matter? What benefits accrue to users and business? Here are just a few examples:

  • The ability to find information, as well as assets associated with that information, depends on how completely the information is described in the metadata layer—regardless of which vendor’s technology hosts it.
  • As self-service users provision data into sandboxes, shaping it into new formats, they can automatically register new sources as well as their originating lineage.
  • Automatic generation of lineage information across a heterogeneous collection of systems can radically cut the costs of regulatory compliance reporting such as those incurred pursuant to BCBS 239.
  • Governance rules can be enforced when data is accessed and moved, helping provide a strong trust model.
  • During transition of analytics from discovery to production, the captured metadata associated with that information can help shorten the time before resulting insights become actionable.

Or take an even more specific example: Different business lines have adopted several different self-service data preparation technologies from open-source technology as well as vendor capabilities. If use of such tools is facilitated by an open metadata and governance model, they can show the same set of data assets defined in the open metadata repository. As users prepare and shape the data, the lineage associated with the operation can be registered and used as guidance for production—or even help create a collaborative environment in which data scientists can pick up what has already been accomplished for further evolution into a pipeline associated with a specific model. Such an open approach can also allow use of metadata from proprietary metadata repositories, whether from IBM or other vendors' products.

Intersecting governance and metadata than merely defining governance, let’s focus on the conjunction of governance and metadata, which raises three particularly important considerations:

  • Definition of a governance program, including policies, rules and classifications
  • Sharing of APIs, structures and formats between tools
  • Automatic enforcement of governance requirements within the various platforms hosting data and related assets

The metadata layer stores governance rules, executing them during enforcement. These rules are linked to classifications that are in turn linked to metadata descriptions of assets, and the chief data officer (CDO) defines governance rules for enforcement when data is accessed or moved. Thus, when assets are accessed, automatic governance enforcement results, with asset classification allowing the metadata layer to select appropriate rules for execution.

Governance rules can also be displayed during pipeline design, informing designers about restrictions on data access or actions that will be taken when data is moved. Enforcement of the rules results when the data is accessed in other layers, such as the information virtualization component, or as data is moved by offerings such as IBM DataWorks.

Creating a persona-based shopping experience

The shopping or search experience will be used by each of the five personas as the primary access model allowing location of assets described in the metadata repository, whether data, reports, data models, analytics models, transforms or other types of API. Accordingly, the shopping experience must ultimately integrate with the API economy in three separate shopping paradigms:

  • A simple navigation model for either business or technical metadata, depending on the persona driving the experience, will be the first implemented and will be used for relatively small numbers of data attributes.
  • A simple search model for business or technical metadata will retrieve assets according to specific search criteria—for example, an end user could request all customer and product information associated with sales by country. Such a query could show all assets associated with a customer, a product or so forth. Like the simple navigation model, the simple search model will be most useful for relatively small numbers of data attributes.
  • A complex contextual search will search through the metadata layer, joined with a knowledge graph built up through the collaborative experience of the people using the analytics platform. This graph will store information about the person driving the inquiry—for example, that person’s role in the organization—and will also include information with which the person interacts, queries made by the user and the like. This approach aims to return information that will be directly relevant to each individual based on his or her responsibilities and position in the organization.

The actual shopping experience will be embedded within each of the persona interaction experiences:

  • Knowledge workers and business analysts will interact with the shopping experience via IBM Watson Analytics.
  • Data scientists will interact directly with the shopping experience or use DataWorks as part of the pipeline building, looking for both data and transforms.
  • Data engineers will use the shopping experience as part of DataWorks, primarily in navigation mode, for pipeline design.
  • Application developers on Bluemix will use the shopping experience to find data as well as assets, such as models, reports, APIs and transforms, for use in the application.
  • CDOs will examine the overall data landscape, deciding where and how to extend governance concepts or set or collaborate on business term definitions.

Providing a strong metadata layer

Metadata is becoming an integral part of self-service bi-modal IT models. Indeed, it holds the key to opening access to all data in the enterprise.

The self-service shopping metaphor describes most people’s first point of interaction with metadata. As people begin to use tools to work with assets, metadata continues to help them use those assets effectively. Thus collection of metadata describing changes to, exchanges of and comments on assets accompanies both development of assets and collaboration around assets. Accordingly, governance must tightly integrate with the metadata layer, providing a model for both defining and enforcing rules.

These topics are highly interrelated, emphasizing the need for a strong, common and open metadata layer to cover the wide range of assets used by organizations. Accordingly, IBM is rallying behind Apache Atlas as the core of its open metadata approach, aligning both products and business partners behind it. To learn more, discover how IBM data analytics technologies can help you heighten collaboration among roles, moving data fluidly through its lifecycle. Click here to see the next installment of this series or here to view the entire series.