Blogs

Building a data reservoir to use big data with confidence

Post Comment
Distinguished Engineer, IBM Analytics Group CTO Office, IBM

Data reservoirs promote continuous innovation by leveraging data and analytics to drive an organization more effectively, which enables new products and services, increases customer service and efficiency, and reduces waste and fraud. It is both a production service for running the analytics that support the business and a discovery and exploration service for the ad hoc analysis of data and the generation of new analytic models.   

Building a data reservoir

Bringing both the development of analytics and the production execution of analytics closer together creates greater agility for analytics development. However, this approach brings its own challenges:

  • How is valuable and sensitive data protected?
  • How is the integrity of the production environment assured?
  • How are the complexity and inconsistencies of production data made consumable to data scientists and analysts?
  • How are new innovations in data platforms and analytics engines adopted?
  • How do people find the data they need?

The data reservoir reference architecture explains how these challenges are addressed through the capabilities shown in the following illustration.

Considerations for a well-managed and governed data lake (also known as a data reservoir).

At the heart of the data reservoir are a variety of data platforms, including relational database servers, Apache Hadoop and other NoSQL data platforms. There is so much ongoing innovation in data platforms today—aimed at enabling the efficient processing of new types of analytics through novel data structures—that it is necessary to allow for the continual introduction of new technology. 

The data hosted on each data platform is organized into collections of related data—the data reservoir repositories.

All data in the data reservoir is described in the data reservoir’s catalog. This catalog defines what data is present, where it came from, who owns it and its characteristics. The catalog is the key to locating and understanding the data available in the data reservoir.

In the data reservoir, no person or external system is given direct access to the data reservoir repositories. This data is considered production data and access is restricted to privileged processes that make up the data reservoir services that surround the data reservoir repositories.

The data reservoir needs a constant flow of new data, either from existing known sources or new sources. It has a sophisticated set of services for interchanging data with a wide variety of systems.

As new data sources are connected to the data reservoir, a curator adds descriptions for them and the data they bring to the data reservoir’s catalog. The curator understands the source system and augments the knowledge of the data source that can be automatically discovered with information about the business use, meaning and significance of the data. This information is invaluable to data scientists who want to use the data, and to the information governance team concerned with managing and protecting this data appropriately. The catalog entry for the data source acts as a contract between the data source owner and the data reservoir team for how data will be governed and used in the data reservoir. Through this transparency, trust grows in the ability to share data and to consume it from many sources.

Information governance policies and rules are located in the catalog. These define how data of a specific classification should be governed. The business owner of each subject area (or domain) of data maintains these governance definitions.

http://www.ibmbigdatahub.com/sites/default/files/datareservoir_embed.jpgInside the reservoir, processes actively monitor usage and management to ensure conformance to the governance program specified in the catalog.

The data reservoir provides services for the agile development of analytics backed with quality assurance processes to enable the deployment of analytics into the data platforms. Production data is protected through data-centric security called by the self-service functions that provision sandboxes for the development of new analytics. From these sandboxes, data scientists can take advantage of any analytics tools that can connect to the data.

Business value comes from the use of data and the resulting analytical insight. The people who work within an organization need access to the data and analytics. This is provided through simplified views of the data, and protected by the same data-centric security for data scientists.

Conclusion

Building and operating a successful big data and analytics environment involves specialist technology, but this is not sufficient. An organization can become truly data-driven only by balancing the active participation of the business and technical teams in analytics with the need to protect data and control costs. Getting the maximum value from big data and analytics requires a new type of environment where data can be assembled for analysis and the generation of insight—in other words, a data reservoir.

For more information

Join me in my session at IBM Insight 2015, 25–29 October, in Las Vegas, where we will explore these capabilities and recommended implementation approaches.

DII-1333: Creating a Data Reservoir to Use Big Data With Confidence
Session Type: Breakout Session – Technical
Date/Time: Wednesday, 28 October, 10:30 AM – 11:30 AM
Venue: Mandalay Bay South Convention Center Level 2
Room: Lagoon K

Here are some of my publications about the data reservoir:

You can use the data reservoir reference materials and related IBM tools to design your big data and analytics environment for self-service access and governance.

Be sure to register to hear me and others speak at IBM Insight. And to accelerate your career journey into advanced analytics and data science, you can explore this informational resource page at IBM Analytics.