Charting the data lake: Building a business language for data lake

Senior Technical Staff Member, IBM Analytics

In Douglas Adams’ The Hitchhikers Guide to the Galaxy book series, the Babel Fish was a rather strange entity that enabled effortless communication between different life forms, overcoming the challenge of the various different languages. While it may be a bit of a stretch to equate the data lake to a galaxy, let alone referring to the various users of the data lake as “life forms,” it can be a somewhat useful metaphor. When the data lake is deployed as an infrastructure to be exploited by different users in various departments with their own needs, their own different requirements and often their own dialects in terms of a business language, then some babel fish-like universal translator can become very useful. Especially with the additional complexity caused by the range of different data structures that may occupy the data lake. Plus the fact that in many cases such structures have been populated with emphasis placed on ease and speed of ingestion, often at the cost of any definition of the meaning and purpose of those data structures. Added to all of that is the growing demands within many organizations that various big data initiatives like data lakes must exist first and foremost to address and be aligned with the business objectives.

One of the solutions that some organizations are reaching for to address this range of issues is to create some overarching business vocabulary that:

  • Enables a common language across the different business users across the organization.
  • Provides a critical communication device between business and their IT counterparts tasked with maintaining the data lake infrastructure.
  • Establishes a reference point when aligning new or acquired businesses into the overall organization.
  • Provides a basis the identification of gaps and overlaps between different projects or activities that the enterprise may be engaged in.
  • Underpins for the overall governance processes associated with the data lake.
  • Provides the basis for critical data lake capabilities such as enabling the business users to adopt a self-service approach to finding and using the required data lake artifacts.

So one way of looking at this business vocabulary is to consider is as a network of business terms, categories, technical asset definitions that both describes the range of artifacts in the data lake but also connects these artifacts to the set of business terms in use by the different business users. In many cases, this business vocabulary may also be extended to describe relevant artifacts beyond the data lake, for example in the Systems of Record or Systems of Engagement.  

Unlike the babel fishwhich to operate simply required the insertion of the fish into ones ear—a common communication device for the data lake requires both initial setting up and ongoing maintenance. Indeed, the growing realization is that such a business vocabulary should be seen as something that is managed and governed via a formal development lifecycle that enables the ongoing evolution of the business language to ensure that it reflects both the underlying data lake artifacts and reflects the ever changing needs and requirements of the business users. A critical success factor to the ongoing acceptance and successful use of such a vocabulary is significant involvement in the business vocabulary by the different business stakeholders.

As it grows with an increasing number of different artifacts and is used to support various users from different lines of business, the typical data lake can rapidly become a significant and potentially overwhelming universe for any single user to navigate. So there is a demand for a common business vocabulary to assist such self-service and navigation activities by the various users.   

So for many organizations looking to build data lakes that support the enterprise, there is a growing recognition that there is a need for some means of enabling efficient consistent communication and understanding of business meaning of the contents of the data lake. For many organizations an extensive business vocabulary is the means to achieve that basis for common communication.   

Considerations for creating a business vocabulary

In determining how to create a data business vocabulary, it may be useful to consider some or all of the following:

  • Who owns the business vocabulary – this is critical both in terms of shaping the business vocabulary, contributing to the budget to ensure it is maintained and taking an active role in the ongoing governance. While the most appropriate owner is likely to vary from organization to organization, in general the chief data officer or some other business-oriented owner is preferable.
  • What are the components of the business vocabulary – in order to address the needs of the different users and enable a working governance process it may be necessary to define a number of components within the vocabulary. Certain components focused on the needs of particular groups of users, components intended to provide IT users with a very precise taxonomy for mapping multiple systems, and other components intended to underpin the effective flow of a term through the appropriate governance phases.
  • What are the needs of the business users – this is something that potentially covers a range of considerations including what parts of the enterprise vocabulary are appropriate for different users, is there a need synonyms to incorporate “local” terminology, what level of complexity of business vocabulary structure is required, and whether there is need to include more technical artifacts (for example data model diagrams) to provide further context to the business users.
  • What are the likely patterns of use – are the users likely to predominantly be simply searching for information or have they a need to do more wide ranging discovery or navigation across the data.
  • What are the required types of vocabulary to be used – the business vocabulary itself could range from a simple flat glossary of terms to an extensive taxonomy with deep hierarchy of terms. Is there a need to provide pre-defined categorizations within the glossary that align with recognized business functions.
  • How to integrate with the data lake – the basic assumption is that the business vocabulary is a run time artifact that is part of the data lake catalog. The terms in this business vocabulary are linked to the relevant structures and repositories across the data lake. 

Managing and governing the business vocabulary

There are a range of different aspects to the governance of the business vocabulary, such as the roles and flow involved in the governance process for the business vocabulary.  Considerations may include what are the appropriate levels of governance in certain parts of the vocabulary, what level of governance, if any, is to be applied to the management of any local or departmental terms. 

 Another key consideration is to determine the most appropriate governance process to enable a tight integration with the run-time data lake environment. How to accommodate any feedback from the technical or business users and how to incorporate potential input to the business vocabulary from a range of external and internal sources (e.g. external standards, existing internal glossaries). How to ensure that changes to the business vocabulary are efficiently deployed for use with the run-time data lake environments. 

Using the IBM Industry Models to build the business vocabulary

A number of organizations are successfully using the IBM Industry Models as the basis for their data lake business vocabulary.  The typical motivations underpinning the use of these models include:

  • The availability of a comprehensive set of fully defined and industry-specific business terms.
  • The use of pre-defined industry-specific structures to provide the framework on which to base the business vocabulary.
  • The existing linkage of the terms in the IBM models between the business vocabulary and more technical-oriented data models.
  • The pre-integration with IBM models and the IBM metadata tooling such as InfoSphere Information Governance Catalog.  

Find more information on the use of IBM Industry Models in building business vocabularies.  

What about ontologies?

For those who know our industry models, they will realize that today the basis for the business vocabulary is, depending on the industry, either a flat glossary of terms or a taxonomy with a somewhat deeper hierarchy of terms. Ontologies have been around for quite some time, but mainly been in the academic domain. However in recent times more practical uses are being identified for the use of a more structured set of semantic representations as encapsulated in ontologies, for example, the emergence of the semantic web technologies underpinned by ontologies, the advent of some Industry-standard ontologies and of course the heavy use of ontologies in IBM Watson. There is a lot of discussion about the role that such extended semantic structures might play as part of the business vocabulary for the data lake. That is an area that may see further focus and expansion in the future, perhaps for discussion in a future blog!     

The next edition of the “Charting the data lake” blog series will look at another main aspect of models use in a data lake, specifically the role of the models in the creation of schema-at-read and schema-at-write Hadoop/HDFS and other structures in the data lake, when to use the models in such deployments and when not to. In the meantime, explore how the IBM Watson Data Platform and a DataFirst approach can provide data that’s easy, accessible and working foryou.