Charting the data lake: Using the data models with schema-on-read and schema-on-write

Senior Technical Staff Member, IBM Analytics

‘So what do you mean when you say “data lake?”’ It is always fun to throw this question into a group of data management professionalsI accept that “fun” is relative term here. Despite the term being in general and wide usage for over five years, the level of variety and elasticity around what different people actually mean by that phrase is pretty impressive. However, over time the general trend for data lakes appears to have been to move away from what was initially seen as primarily a collection of Hadoop/HDFS clusters towards a broader set of enterprise data assets requiring a common set of services for access and a common governance and integration fabric. In many cases the data lake can be defined as a super set of repositories of data that includes the traditional data warehouse, complete with traditional relational technology.

One significant example of the different components in this broader data lake, is in terms of different approaches to the data stores within the data lake. There are a set of repositories that are primarily a landing place of data unchanged as it comes from the upstream systems of record. This data is largely unchanged both in terms of the instances of data and unchanged in terms of any schema that may be implied. There is no attempt to enforce any sort of schema as this data is loaded into the data lake, a schema only comes into play when users try to read these stores—hence they are called schema-on-read data stores. This is in contrast to the types of stores more familiar to people coming from the traditional data warehouse world, where typically a lot of effort is expended in setting up a set of data stores that have a consistent and standard schema enforced from the moment they are created, hence the phrase schema-on-write.  

There are many possible combinations of these different approaches to the schema for the data repositories across a data lake, one possible usage of these different storage approaches is here.

The landing zone, the area used to initially store the data coming into the data lake, relies on schema-on-read structures to enable the rapid ingestion of incoming data in its native form. Similarly in the data scientist sandbox, the focus is on the rapid collection of data for discovery and analysis, hence the adoption of schema-on-read here as well.  Whereas the central analytics area, intended for use by the broader business community with an emphasis on ensuring standardization and consistency, will likely use schema-on-write. In such cases this need for consistency and standardization outweighs the benefits of the ability to rapidly load the data.

Schema-on-read versus schema-on-write 

A lot has been written elsewhere about the different uses and pros and cons about these two approaches. In summary, schema-on-read allows for the rapid landing of large amounts of data into the data lake but requires extensive tagging of such data to ensure that it is generally usable across the enterprise. In essence such schema-on-read stores accurately reflect the data as it exists in the upstream systems, complete with any inconsistencies that often exist between the treatment of data across these different source systems.  The schema-on-write data stores require a lot more up-front preparation and ongoing transformation of the incoming data, so is more costly to set up and maintain but have the advantage of storing the data in a more standardized and consistent fashion. The beauty of the data lake is that in combining both types of data stores, it is possible to support a greater range of users and activities: data scientists can discover critical hidden insights in the unprocessed data, whereas as the more regular business users can benefit from the standard and refined data for more predefined and repeatable purposes.

The role of data models

The first assumption is that there is a need to worry about the definition of schema at all. If the data lake is primarily a simple collection of Hadoop/HDFS files and little if any focus on enforcing a schema, then it may be that the use of such data models is not necessary. However, there are also a growing view that data lakes are indeed hybrid collections of repositories, some with pre-defined schemas and some without. Then in such cases, there is a potential continuing role for data models. 

So this concept of the data lake being a combination of different technologies, brings up the question of how to use data models as a means of enforcing a degree of consistency and standardization across a set of data stores with quite radically different philosophies in terms of the storage.  As discussed in the previous blog, the focus on ensuring a common business meaning across the different components can be addressed via the use of a business vocabulary, whereas the traditional entity-relationship data models are concerned with the enforcement of standardization of the schema, the question is what is the role of such E-R models when in some cases there is no schema.

The diagram below outlines the potential role of ER models when it comes to a hybrid environment containing Hadoop HDFS structures (both schema-on-read and schema-on-write) and a traditional relational database. 

In looking at this set of different models components, it is necessary to consider three different activities:

  1. The use of the data models to generate the repositories in the data lake for which there is a need for a predefined schema. So this is the classic flow of overall platform independent outline being defined in the logical data model with the subsequent generation of different downstream physical data models for the RDMBS and the schema-on-write HDFS repositories. The different physical models for HDFS and RDBMS are needed to accommodate the significant differences in these two environments, for example the lower level of tolerance for normalized structures in HDFS as compared to RDBMS.
  2. The reverse engineering of new physical models that reflect the structure of the schema-on-read repositories. This provides data models with a visibility into the structures of these repositories, which becomes important when there is a need to design new schema-on-write structures to store more refined data initially stored in the schema-on-read HDFS files. In some cases, depending on the data model tooling used, it may also be possible to create mappings between these reverse engineered physical models and the canonical logical models.  
  3. Finally, there is the mapping of the different data model elements (along with the associated physical artifacts) to the business terms in the business vocabulary.

There is a range of further more detailed considerations that need to be taken into account in this area, for example: the selection of the appropriate data models to use, the different denormalization options, handling of keys. For more information on this refer to the document Guidelines for deploying an IBM Industry Model to Hadoop.

The next edition of the “Charting the Data lake” blog series will look the different development lifecycles for defining the various types of models artifacts in the data lake, specifically the lifecycle for defining business meaning and the separate lifecycle for defining common standard structures. In the meantime, explore how the IBM Watson Data Platform can form the foundation for your enterprise data lake.