Blogs

What has the catalog ever done for us?

Conforming the data lake

Senior Technical Staff Member, IBM Analytics

The ancient Romans knew a thing or two about the value of conformance. Take their approach to religion. In Latin, the word “religio” roughly means “something that binds or connects.” Strict conformance to the rituals and traditions of their religion was an integral part of society, especially as the Roman world incorporated a diverse range of people.

Those of us who learned our trade in the era of data warehousing have been forced to change our fundamental belief system, a set of principles that were held up as gospel for decades. We even saw our very own mini-schisms as people debated the respective values of “Kimball” or “Inmon.” One of those fundamental principles was the quest for conformance.

Conformance of data from potentially multiple source systems into a cleansed environment of the data lake was one of the characteristics that marked a well-defined data warehouse. Conformance was one of the most important characteristics of the data warehouse and was what presented business users with the data in a form that isolated it from various source systems.

So what happens now, as we go beyond the frontiers of the data warehouse and into the world of the data lake? This is the world of Hadoop, NoSQL, schema on read and discovering the data as-is. For many organizations, the Holy Grail is to reap the benefits of the data lake while retaining a degree of control and governance. While it is desirable to have the broadest possible range of diverse data, the data on which the business runs must have the appropriate levels of provenance and lineage.

Physical conformance

The means of achieving the necessary degree of conformance, currently known as “schema on write,” is very costly for creation and maintenance of extract, transform, load (ETL) processes, despite being effective in conformance. “Physical conformance,” the conversion of the potentially different sets of data into a single, physical structure in the data warehouse, is anathema to the data lake world of Hadoop and NoSQL.

There will always be cases with a need to carry out physical conformance, especially when populating the data warehouse. But what about the growing range of data that remains in the areas of the data lake where no attempt is made to enforce such physical conformance, which are called “schema on read”?

Catalog conformance

Because physical conformance goes against the ethos and best practices of loading and using data as-is, it is necessary to determine other means of achieving conformance. The traditional approach to data warehouse conformance – the use of a consistent schema – is not available to potentially large areas of a data lake. If it is not possible to conform the data or the data structures, one must consider how to conform the layer of business metadata that describes the data lake.

The existence of an accurate and comprehensive business-oriented data catalog is critical for business and technical users to effectively to navigate around the data lake. Business terms in that catalog must be then organized in a coherent way that supports the needs of the enterprise, with terms grouped to address different business issues and support common, cross-departmental collections of data.

A more critical role of the catalog is the increasing levels of automation and machine learning. Previously, one of the biggest disablers for such a separate layer of metadata was keeping this layer in sync with underlying data repositories. The advent of machine learning in auto profiling and auto classification of data and the ability to automatically map classified data elements to relevant business terms means that the catalog is more adaptive than reactive to changes in the underlying data lake repositories.

This comprehensive and well-structured catalog of business terms provides a spine of business metadata against which different data repositories and technical assets in the data lake can be mapped. This provides a conformed layer of business metadata to support any navigation or discovery of the underlying data lake assets. In essence, the catalog becomes the thing that “binds and connects” the many different aspects of the data lake.

Virtual conformance

Such catalog conformance provides users a more coherent understanding of the constantly changing content of the data lake and does not isolate users from the different schemas that now inhabit it. Anyone building a query should be aware of such schema differences when writing queries. The growing SQL on Hadoop space means that there are increasingly sophisticated capabilities available. With such tools, users can be isolated from the many complexities and differences across the data lake, achieving a form of “virtual conformance.” Given the increasing focus on the data virtualization space, one can expect increased capabilities in the future. In the meantime, however, users building such queries can look to a comprehensive, accurate and conformed catalog to guide their activities.

Watch this webcast to learn how you can govern your growing data lakes with IBM Industry Models, including a live demo on how to bring data stored across different data lake repositories such as Hadoop, Cassandra, MongoDB or DB2.

IBM Industry Models include predefined industry-specific hierarchies of business terms that provide the basis for business meaningful conformance of the catalog. The broader IBM Unified Governance and Integration portfolio provides the governance fabric to ensure the catalog is maintained and kept in sync with the underlying data lake.