Incorporating machine learning in the data lake for robust business results

Director for Watson and AI applications, IBM

After my last blog on data monetization, I got several queries from customers, partners and colleagues. One common questions was, if we initiate building a data lake, how can we make sure the data lake is ready for advance data monetization use cases? I write from first-hand experience with my customers at different stages of their journey on building data lake. First and foremost, the journey starts from proper governance strategies which make the data lake a “trusted data lake.” Next, moving from “trusted data lake” to data monetization involves a well-defined machine learning model and data science capabilities incorporated in data lake. This can be enabled by strong frameworks like Apache Spark. Spark goes beyond batch computation, providing a unified platform that supports interactive analytics and sophisticated data processing for machine learning and graph algorithms. The data lake is envisioned to deliver business outcomes ranging from real-time customer targeting and micro-segmentation, real-time risk and fraud alert monitoring or regulatory reporting.

Building a data lake is one of the stepping stones towards data monetization use cases and many other advanced revenue-generating and competitive edge use cases. What are the building blocks of a “cognitive trusted data lake” enabled by machine learning and data science?

1. Data lake based on open source and open standards a platform and technology framework that can ingest data in its full fidelity, in as close to its original and raw form as possible. Apache Hadoop provides open source ecosystems so you can bring together and link multiple data sets. 

The platform based on open standards and open source offers a flexible framework with no modeling to use in a system with schema on read, no disruptions or barriers to enter.

Case in point is a telco who is building a trusted data lake enabled by Apache Hadoop. The key is leveraging an iterative process to build a full view of customer and their activities over time across the channels. As data scientists and power users explore and find customer insights, they can enrich the profile using their models and outcomes. Open source and standards make this process agile with a fail fast approach. This is the first step towards a data monetization business use case and will lead to additional revenues for the organization. 

2. Provide a data discovery and exploration facility for business users, analysts and data scientists 

While it is good practice to start any big data project with the use case, it is excellent practice to make space for experimentation and future unforeseen use cases. The data lake should serve data efficiently to business users and applications ultimately helping IT to meeting SLAs. A Hadoop-based data lake should have a strong integrated toolset that supports self-service with data discovery steps: data access, exploration, prep, visualization and analysis. Self-service exploration isn't for everyone; it succeeds only when provided to certain classes of users who are governed carefully. This brings us to the next step in trusted data lake.

3. Governance strategies 

Lack of proper data governance strategies can lead to failures and data lakes turning out to be data swamps. Gartner famously predicted that “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” Comprehensive governance strategies with audit, lineage, metadata management and policy enforcement will ensure that a data lake becomes a “trusted data lake”. 

A robust governance framework for Hadoop should enforce compliance to data security, privacy and retention policies and processes to ensure continued trust by consumers and regulatory and legal requirements. Apache Ranger and Apache Atlas are emerging as leading ecosystem technologies to make data governance work for a Hadoop-based data lake. 

4. Robust machine learning framework 

Probably the least appreciated, but the rarest and most important characteristic of the data lake is incorporating a machine learning framework like Apache Spark. This important feature will assure that you get data monetization, customer experience and various other revenue impacting use cases out of your “trusted data lake” project.

Apache Spark is a cluster computing engine well-suited for large-scale machine learning tasks, and its packages exhaustive libraries - SparkML (formerly IBM systemML), PySpark, SparkR et cetera. IBM SparkML is designed to make it easier to express system learning algorithms.

Spark provides capabilities to implement distributed algorithms for fundamental statistical models (linear regression, logistic regression, principal component analysis) while tackling key problems from domains such as online advertising and cognitive neuroscience.

The Spark service when added to the “trusted data lake” provides an easy way to load data, examine data, and pass the results back to the application use case creating a feedback loop, all without the work of setting up the supporting infrastructure.

Using IBM Data Science Experience with Watson Data Platform links business with data science. Data scientists and business analysts can collaborate and elaborate on what’s possible and what can be digested by the machine learning models in R, Python or Scala. This provides a mechanism to close the gap between insight and data scientists so business users on ground can go out and drive business impact.