Blogs

Four perspectives on data lakes

Post Comment
Distinguished Engineer, IBM Analytics Group CTO Office, IBM

Recently I was involved in creating a series of short videos about data lakes with a number of other IBM colleagues. These videos introduce four perspectives which cover the areas of architecture, value, innovation and governance. Data lakes are a very popular concept in the industry at the moment, but definitions of a data lake seem to vary widely

 My view is that a data lake is a reference architecture that balances the desire for easy access to data with information governance and security. The data lake reference architecture describes the technical capabilities necessary for a system of insight, while being independent of specific technologies. Being technology independent is important because most organizations already have investments in data platforms that they want to incorporate in their solution. In addition, technology is continually improving, and the choice of technology is often dictated by the volume, variety, and velocity of the data being managed.

A system of insight needs more than technology to succeed. The data lake reference architecture includes description of governance and management processes and definitions to ensure the human and business systems around the technology support a collaborative, self-service, and safe environment for data use.

Governance is a practice that you apply to “something.” Just like James Watt’s fly-ball governor for the steam engine, a governance program seeks to keep a engine in balance so it works effectively.  This engine may be a process, organization, or flow of information.  The important point is that the target of what you are governing is clearly defined.

Approaches to governance, particularly around a data lake, vary widely due to the different choices that organizations make in their definition of the engine being managed. For example, the IT department may see the data lake engine as a collection of technology working together. The business may see the data lake as part of an innovation engine helping them to create new value from data. So which is the right engine to govern? It depends on the objective for data lake. A good starting point in defining the governance program for the data lake is to consider the perspective of each of the principle groups of users for the data lake and define the engine that each see and think what mechanisms it would take to create balance in each of these perspectives between effort and value.  

For example, the owner of a system that is supplying data to the data lake is required to maintain the catalog entry for the data coming from their system, and in return, they could get analysis on the quality or consistency of this data that helps them provide a better service to their users. A data scientist may be restricted in how they work with sensitive data, but in return they get a rich catalog of data to choose from and easy processes to get permission to use the data sets they need. They may also be given the ability to contribute data and content for the catalog. The more they contribute, the easier the discovery process becomes. By balancing the needs of the suppliers with the needs of the consumers, the balance of effort and value is achieved, creating a sustainable ecosystem.

In addition to designing the governance program to the perspective of the users, it is also necessary to decide who is in control of the data lake - whether it is IT or the business will affect how the data lake is governed. When IT is in control, then normal IT governance can manage many of the aspect of the data lake. However, when the business is in control, the mechanisms that operate the data lake, and the classification that identify the different types of data, need to be abstracted through services and metadata to create a view of the data lake that makes sense to the business and can be modified by them as needed. This view is then mapped to the actual data and technology through the metadata in the catalog and the metadata settings are used by the data lake services to drive the behavior of the data lake.

Once the engine have been defined, the governance program is designed in the normal way:

  • Setting standards for the metadata, formats and best practices for the data lake.
  • Measuring and monitoring the adherence to these standards and
  • Taking action as appropriate such as managing exceptions, answering compliance questions and modifying the program based on feedback.

I would like to end by emphasizing the importance of feedback in achieving balance and value. Governance programs must be dynamic and demonstrating the value that they deliver. The feedback mechanisms should not be forgotten as they enable the governance program to stay relevant to the changing needs to the business which in turn changes the nature of the engines we need to govern.

Please visit the IBM Data Lake on ibm.com to view the video series and much more.

About Mandy Chessell