Charting the data lake: Model normalization patterns for data lakes

Senior Technical Staff Member, IBM Analytics

“A place for everything, and everything in its place” Benjamin Franklin

The data lake can be considered the consolidation point for all of the data which is of value for use across different aspects of the enterprise. There is a significant range of the different types of potential data repositories that are likely to be part of a typical data lake. These data repositories are likely to be required to address a number of different roles to address the needs of the different users, for example:

  • Storing of raw unprocessed data v’s storage of highly processed data
  • Managing normalized data v’s managing aggregated data
  • Data for general use across the enterprise v’s data for use for a specific business purpose or for a specific set of business users
  • Data structures with no predefined schema v’s data structures with a predefined schema.

The other key question in building out the data lake repositories is to what level is some standardization or consistency of schema desirable or even necessary. It is a valid choice for an organization to decide that there is no need to enforce any degree of standardization of schema across the data lake. In this case, the expectation is that whatever virtualization layer is in place is capable of guiding the different users through the array of different structures, the duplication, the different terminology. In other cases, the decision is taken that at least some parts of the data lake need to comply with some degree of standardization in the data base schemas, even in cases where such data bases are still doing a range of different jobs and so may need to be structured differently.  

The diagram below shows such a typical collection of different data structures combined together within a data lake.

Diagram_1_Pat.png (612×254)


In the landing zone, the focus is on initially ingesting the data as it comes into the data lake in a raw format. The landing zone may also be extended to enable some initial processing of the data enabling it to be more generally useful across the data lake. The data scientist sandboxes are typically ad-hoc issue-specific, very flat structures containing many repeating groups of data. The Data Warehouse typically has a need for storing data in a reasonably flexible format for various different downstream uses, whereas the data marts are often made up of a lot of aggregated data focused on a specific business issue or a specific group of users.  

So a key potential role of a data model is to enable a degree of standardization across such a disparate set of repositories. So that where possible, the same or similar structures and terminology are used to assist with the subsequent understanding and navigation of the data lake.  The need to address potentially different characteristics across these data repositories in terms of how the data is stored and accessed, is where the different normalization patterns come in.  


The typical data models traditionally used in the construction of data warehouse structures would often start with a model structure that has a reasonably high degree of normalization, typically a degree of normalization that provides the necessary level of flexibility to allow the effective representation of the various business needs. When these models are then transformed from such a logical platform-independent format into a more platform-specific format, varying degrees of denormalization takes place in order to ensure physical models that are performant in the specific physical environment

The focus on denormalization becomes critical in the context of the data lake and specifically in terms of any associated Hadoop/HDFS data structures.  The traditional trade-off to make when considering the appropriate level of normalization/denormalization to define in physical platform-specific structure is between

  • the flexibility of storing data elements in their most granular and most atomic form, which enables the ability to store data to address many different business issues, in many case even to address as-yet-unanticipated business questions.
  • The storage of data in a format that is close to that required by the immediate and/or known business requirements. Adopting this denormalized approach enables simpler ETL, simple access SQL, but at the cost of flexibility. 

In general the tolerance and performance of Hadoop/HDFS structures in handling normalized data structures is not as good as that which has been possible with relational DBMSs.  So this typically means that the same logical data model can result in quite radically different denormalization approaches when being deployed to RDBMS and HDFS physical structures. The diagram below shows an example of such different denormalization patterns from the same logical data model (in this case taken from the IBM Banking and Financial Markets Data Warehouse).

Diagram_2_Pat.png (699×374)


In this example the main entities of Arrangement, Involved Party and Campaign are connected via the relevant associative entities (IP AR Relationship and AR EV Relationship). At the platform independent level it is possible to start to virtually group these entities into different diagrams to reflect the coverage of the two different business areas.

However when these are then transformed into the platform specific physical models, their treatment can be quite different depending on the target data repository. When deploying to the various RDBMS environments it is deemed acceptable to leave the associative entities in place and represent them as individual tables. However for the deployment to Hadoop/HDFS, the decision is taken to converge the various entities into a single combined file structure.

Different normalization patterns

Overall the different options to be considered when denormalizing data models such as those supplied as part of the IBM Industry Models are:

  • Vertical reduction of the supertype/subtype hierarchy – this is effectively the “roll up” or “roll down” of the super-type hierarchy in the logical data model (for example the “Party” hierarchy consisting of Individuals, Organizations, Organization units, etc).
  • Reduce or remove the associative entities – these entities are also sometimes known as “link entities” (for example the Patient/Claim entity in Healthcare). These provide significant flexibility when implemented, but can also result in significantly more complicated SQL logic to traverse the data structures. The alternative is to collapse the information into a single entity (so resulting in a patient entity which incorporates potentially repeating groups of all the claims associated with that patient).  A variant on this is the converging of entities involved in 1-to-many or many-to-1 relationships (for example where an Asset entity has a relationship to a separate Asset Model entity).
  • Reduction or removal of the Classification entities – in certain models there is an extensive set of classification entities designed to call out the different type mechanisms across the models (for example the Credit Card Type in a banking model). Even in traditional RDBMS deployments, these classification entities are often removed, something that is even more likely to happen with Hadoop/HDFS deployments.
  • Denormalize the Accounting Unit structure into a single profile entity – the set of logical entities in the “Accounting Unit” of the IBM Industry Models typically provide a highly flexible mechanism for the storage of numeric attributes such as balances, totals, counts, averages, etc.. The denormalization option here means the possibility to converge these entities into a single profile table for the area to which the figures relate (for example converging the “Agreement Credit Risk Accounting Unit” entities into a single “Agreement Credit Risk Profile” table).

There is more information on these different denormalization patterns and other logical to physical deployment considerations for Hadoop in the document Guidelines for deploying an IBM Industry Model to Hadoop.

The next and final blog in this series will focus on the possible role for the different models to play when assisting data scientists and their activities.