Development lifecycles for defining the meaning and structure of the data lake

Senior Technical Staff Member, IBM Analytics
 “They could discuss without quarrelling and cooperate without getting in each other’s way,  Philip Pullman, The Amber Spyglass

In the “His Dark Materials” trilogy, the author Philip Pullman describes the adventures of two children who have discovered the supernatural tools necessary to travel through the many different parallel worlds. While these worlds are completely separate, they share and are influenced by a common afterlife.  This idea of totally separate parallel worlds, can be a useful concept when trying to understand the different but still somehow linked universes that exist when looking at the various model and vocabulary development activities around the data lake.

In the past, the relationship between the different models that might be used in defining a data warehouse was a very linear one. There may have been different model artifacts used as the team responsible for developing the data warehouse progressed through the usually waterfall-type set of activities, starting with high level business definitions, progressing to use of logical models and eventually ending up with a set of physically deployed schemas that are intended to address the initial business needs. The arrival of the data lake has significantly changed the approaches required for the creation of models. Previous editions of this blog described the elevated role that the business vocabulary should have when building a data lake as well as describing the different role for data models. In this edition, the various development lifecycles relating to these different types of model will be introduced. Similar to the literary reference at the beginning, these are completely separate and different lifecycles, with quite different roles/personnel involved with quite different objectives. However these lifecycles must also be interconnected, ideally via an overarching governance and communications process. 

Different but parallel lifecycles

So it is necessary to both build out the infrastructure to ensure the appropriate level of management and control over the data lake, and to enable the individual model development processes to progress independently.  Essentially this means that there are likely to be three different, parallel, and quite distinct lifecycles emerging:

  1. The lifecycle regarding the evolution of the business language.
  2. The lifecycle regarding the creation of any data models need to specify the schema in the subset of the data lake for which a pre-defined schema is considered necessary
  3. The lifecycle relating to the ongoing maintenance of the physical artifacts in the data lake.

The diagram below outlines how these three different lifecycles interact with the data lake. 

Defining the business meaning

Lifecycle 1 is concerned with the definition and maintenance of the business meaning of the various terms that make up the overall language associated with the data lake.  As the data lake is likely to be used by a range of different users (business and technical), it is critical to ensure that it is possible to enable a commonly understood meaning for all of the terms used to describe the data lake. Not doing so would mean that there is no possibility for the sharing of potentially valuable data resources across the enterprise, and not having a workable business language is a major disabler to key objectives of most data lakes such as promoting self-service and guided search.  The typical objectives of this first lifecycle include:

  • Ensuring that the relevant business aspects are covered by the data lake catalog, including any local or departmental terms if necessary
  • Ensuring consistency of language and meaning across the different groups of Data Lake users
  • Determining the most appropriate level of detail to be contained in the business vocabulary and what are the agreed supporting artifacts (for example: data models, conceptual models) 
  • Determining the structure of the business vocabulary, how to ensure it can address the potentially quite varying needs of the different users.

This lifecycle is typically owned by or at least heavily influenced by the CDO organization, and the main focus is on identifying the various different business terms used across the enterprise and determining those suitable for use with the data lake. Typically, the people involved in this lifecycle tend to more concerned with the business terminology as opposed to the technical implementation details, for example; business analysts, representatives from the different businesses supported by the data lake, and the business stakeholders. So the involvement of IT/technical staff is usually more in a peripheral or supporting role. 

In terms of IBM Industry Models, this is the lifecycle that is used to select, customise and manage the various business vocabulary elements provided, specifically, the Business Terms, Analytical Requirements and Supportive Content (referenced as BT, AR and SC respectively in Lifecycle 1 of the above diagram).

Defining the various structures needed in the data lake

Lifecycle 2 is concerned with the definition of the structure, when required, of the different data repositories across the data lake.  In some cases, there will be areas of the data lake where there is no need for the definition of schemas (for example the deployment of certain schema-on-read structures in the landing zone of the data lake). However, in most data lakes, there is likely to also be areas where the definition of a schema is required, for example in the area traditionally occupied by the Data Warehouse, or where there is simply a need to converge the many different incoming representations of a particular type of data into a single schema to assist with subsequent use by business users. In such cases, especially where there are potentially many such repositories, then the need for the enforcement of consistency of schema structure and terminology becomes important.

The fundamental objective of this lifecycle is the transformation of the initial business requirements into a set of logical data models and ultimately into the necessary physical data models for deployment into the data lake.  The initial business requirements can be represented by specific subsets of the business vocabulary perhaps supplemented by a high level conceptual model.

This lifecycle is typically more technical in nature than lifecycle 1 and the key people involved would include data modellers, data architects, database administrators and other related development personnel.  Typically, the role of business user representatives is in a review/oversight capacity to ensure that the data model development reflects the overarching business needs.

The IBM Industry Models components usually involved in this lifecycle are Business Data Model (BDM), Atomic Warehouse Model (AWM) and Dimensional Warehouse Model (DWM). These models would be scoped and customized to align with the business areas to be addressed and the physical data models would be the main output, intended for subsequent deployment as part of the third lifecycle.

The deployment and usage of the models-driven data lake artifacts

In this third lifecycle, the main focus is on the deployment of the different models-driven artifacts into the physical data lake production environment and the subsequent usage of these artifacts by the technical and business users. Typically this lifecycle would be concerned with: 

  • The deployment of the published business vocabulary by the different data lake users.
  • The deployment of the generated data model artifacts into the data lake repositories.
  • The use of the business language by the business users to assist their search and discovery activities across the data lake.
  • The use of the business language by the data lake operations team, specifically the ongoing mapping/tagging of any data lake components to the data lake catalog.
  • The management of feedback from the various data lake users to the business language and data modelling Lifecycles.

The typical users involved in this lifecycle ranges from the operation/development staff concerned with the deployment of the physical data models, to the business users who are using the business language to underpin their searching and navigation/discovery of the data lake to the governance personnel and data stewards who are focussed on the integrity of the data lake operations.

In terms of the IBM Industry models, this lifecycle is concerned with the deployment of the data lake repositories using the physical model variants of the Atomic Warehouse Model (AWM) and Dimensional Warehouse Model (DWM) as defined by lifecycle 2. This lifecycle is also concerned about the use by the data lake users of the Business Terms (BT) as defined by lifecycle 1.

It may help to look at a more precise example of such a physical environment deployment with some relevant IBM tooling, specifically IBM Infosphere Information Governance Catalog (IGC) and IBM Infosphere Data Architect (IDA).

As shown in the diagram above, the business language environment (Lifecycle 1) is concerned with the evolution of the common business language for the data lake. Often, organizations may use the separate development and published instances that are available within IGC to manage the separation of the development of the business language in lifecycle 1 from the day to day use of the business language by the users in the runtime data lake environment (lifecycle 3).

Similarly the physical data models can generated in IDA for deployment of the appropriate Data Lake repositories. There is usually a separate development or design-time repository to manage the evolution of the data models as they are being developed during lifecycle 2, as opposed to the subsequent deployment of the resulting physical data models into the data lake. 

There is also a need for the periodic integration or alignment of the artifacts being evolved in these different development lifecycles as they approach the point of deployment to the runtime data lake environment. In the case of the IBM tooling in this example, such integration can be carried out by IBM Infosphere Metadata Asset Manager (IMAM) to enable import of logical and physical data models into the IGC run-time repository for reference purposes.

In addition, there is typically an overarching governance process that oversees the parallel development of the business language and data models to ensure alignment in terms of coverage of business issues and the coherent integration of the derived artifacts as they are deployed into the runtime data lake environment.  In addition to overseeing the extension of the data lake to ensure consistency in subsequent deployments, this governance process should also ensure that any experience and lessons from the day to day usage of the data lake artifacts is fed back into the business language and model development lifecycles. 

There is more information on the implementation of these different lifecycles and the associated considerations in the document “IBM Industry Models support for a data lake architecture”.

The next blog in this series will consider the role of the different types of data models as a basis for the deployment of various types of data lake repositories.