Charting the data lake: Rethinking data models for data lakes

Senior Technical Staff Member, IBM Analytics

For decades, various types of data models have been a mainstay in data warehouse development activities. This “charting the data lake” blog series examines how these models have evolved and how they need to continue to evolve to take an active role in defining and managing data lake environments. This role can include several provisions: 

  • A basis for the data lake’s metadata layer
  • The standardized basis for schema design across the data lake
  • Valuable input to the governance of the data lake 

Usually, a reasonably defined scope, structured formats and a set of well-defined user patterns implemented through a series of mainly predefined reports exist in the traditional data warehouse. In many such data warehouse designs, having an overall active metadata repository was considered a good idea; however, all too often it was never really implemented. 

Defining a standardized structure

In many cases, when people spoke about a data model for data warehouses, they were almost always referring to the set of entity-relationship models that defined the structure and schema. Why? The data model was required to define what was most important—the definition of a standardized structure for common use by different parts of the enterprise. Given the relatively limited breadth and variability, justifying the cost for a separate definition of the business meaning behind data warehouse elements—typically represented as a set of business terms in a metadata repository—was often challenging.

As organizations start to move in increasing numbers beyond the data warehouse toward data lakes, a number of significant differences are forcing them to rethink what set of models might be needed to support this new and different landscape. The typical or intended scope of the data lake tends to be much wider than that of the traditional warehouse because data from many more sources and in many more different physical formats is in play. Typically, data lakes employ a wider range of technologies, from Hadoop Distributed File System (HDFS) clusters for enhancing a traditional warehouse, to incorporating streaming data, data virtualization and data stored across a combination of on-premises and various cloud platforms.

Perhaps most significantly in terms of model usage, the extended user set that needs to access the data lake and have increased expectations about when and how it extracts information needs to support self-service access across this widely complex landscape.

One early reaction to the question of the potential role of data models in this new landscape was that there wasn’t actually a role. The dynamic, fluid nature of the capabilities such as schema at read seemed to indicate that predefined schemas as encapsulated in models of various kinds were no longer needed. However, now that such collections of technologies are entering the mainstream, many organizations are concluding that for ongoing management and efficient use of this new and extended environment, a valid if somewhat changed role still exists for data models. 

Meeting the need for common business meaning of the initial areas of focus in relation to data lakes is establishing the common business meaning for the many different artifacts being stored in the data lake. The different data lake characteristics, many different types of artifacts, a more dynamic approach to loading these artifacts and a range of different users all call for a single, commonly understood definition of each artifact’s meaning. And that meaning is a critical prerequisite for any sensible management of the data lake, especially where the data lake is intended for enterprise-wide purposes. 

The ability of an organization to maintain an accurate, business-meaningful glossary or taxonomy of the terms that describe all the artifacts in a data lake is critical for a wide range of users in different areas of the enterprise. It enables users to search, discover, understand and use the appropriate elements of the data lake with little or no need for interaction with the IT organization maintaining the data lake. 

Creating and maintaining such a glossary needs to reflect the organic, fluid nature of the underlying data lake. The elements of the glossary need to be meaningful to the business. They need to reflect the local business language—or different language dialects—and be a genuine aid to different users when they are searching the data lake to determine which combinations of data elements are most appropriate for specific needs. A predefined and sometimes industry-specific glossary or taxonomy can provide an important resource for kick-starting a data lake vocabulary. 

Addressing the ongoing need for standardized schemas 

A lot of the initial focus on data lakes was on the collection of various HDFS clusters that simply landed the incoming data as is, and the schema was used only at the time of reading the data. In such cases, the need for a predefined data model to enable creating schema-at-write structures wasn’t necessary. This situation is still the case for many data lakes in which the intention is to quickly land the data as it comes off the upstream systems of record.

However, where a data lake evolves to also include more structured elements—such as preexisting data warehouses or simply reactions to defining a more ordered environment supporting user activities—a data model is needed to assist with establishing standardized structures. Here is an example of the potential roles played by these different models when creating and continuing to manage and use a data lake: 

The business vocabulary can manage and define the common business meaning of all the elements in the data lake as well as in potentially associated systems. This model type is usually either a simple glossary or a hierarchical taxonomy and provides a useful business-meaningful reference point to which all the different artifacts can be mapped. In addition to the elements in the different data lake repositories, elements in other associated systems that may be of use to users of the business or technical data lake can also be mapped to this business vocabulary, if necessary.

The data models are used in the generation of the physical schemas either for Apache Hadoop and HDFS or for traditional relational database management system (RDBMS) structures. If a business need exists for a standard schema for a subset of the HDFS elements in the data lake, then potentially having these different physical models derived from a common logical data model is often beneficial.

Governing the data lake

Some suitable level of governance is critical as the data lake grows for use by various users and departments across the enterprise. The role of a standard business vocabulary is a critical building block for such a governance process. It enables the different business and IT participants to have a shared understanding of the various elements in the data lake.

And this understanding is especially vital considering that such elements are likely to have come from a wide range of source systems with different characteristics and structures of their own. Additionally, a data model is critical to enable a standardized and efficient governance process for the subset of data lake repositories in which a need for the enforcement of schema at write exists. 

Learning more about the role of data models

While this installment introduces the potential role of models in data lake environments, additional details about the overall role of models in the evolution of data lakes can be found in IBM Industry Model Support for a Data Lake Architecture. The next installment in this series covers the specific considerations for building a business language for the data lake.

Look for subsequent installments in this series to describe a range of different considerations for using data models in conjunction with data lakes. They address such areas as when and when not to use models for defining data lake repositories, the different data model development lifecycles associated with data lakes and the different normalization approaches across the data lake. In the meantime, explore how the IBM Watson Data Platform can form the foundation for your enterprise data lake.