Data Lakes, Analyst Observations, and Reality

Fit-for-purpose architectures can bring business outcomes down to earth

Solution CTO, IBM

A recent warning from Gartner that hype around certain data lakes is a bit ahead of the actual releases has created quite a lot of buzz. What makes this bit of news especially interesting is that the research report that triggered all the hubbub was actually pretty mild. In essence, it said data lakes are useful, but organizations need to have their eyes open in terms of security and governance gaps and many attributes that can be taken for granted in well-defined, mature enterprise data warehouse (EDW) environments. Who would’ve thought such a sentiment would stir up so much controversy in the big data arena? Actually, I did. For a while now, some of us in the big data space have been issuing warnings that the laws of gravity still apply to big data. In other words, no one-size-fits-all solution can work in practice. No new or existing technology is perfect. I cautioned clients back in 2011 that if organizations need governance and data quality competencies to have a healthy and appropriately managed EDW environment, they’re going to need a solution for anything they use to attempt to replace or supplant that environment, including a big data environment. To put it mildly, governance and data quality considerations have not been the focus of the niche data lake vendors, and having Gartner call them on it was bound to provoke a response.  

Shifting the focus

Not surprisingly, the big data vendors that focus only on data lakes are up in arms about this warning. It can be likened to a case of “when you have a hammer, everything looks like a nail.” Having Gartner and other analysts shed some skeptical light on claims that confirm a nail isn’t the only fastener in the toolbox—and sometimes a screw or a bolt is called for instead—was received as a threat rather than a necessary bit of real-world caution. More specifically, Apache Hadoop is a fantastic technology when it is used in the right place, at the right time, and with the right skills. Those caveats often mean the work is being done in a data exploration zone where the data is understood to be variable and isn’t considered master data. Data exploration zones are rapid-analytics environments for trying to work with any amount of vital data to gain business insight that hasn’t been derived before. A data exploration zone is an example of a fit-for-purpose architecture. In this case, data exploration zones are designed to sprint toward understanding the business outcome in the context of the operating environment first.  

Driving business outcomes

In such situations, outcomes are being driven very rapidly and without explicitly trying to stand up a tightly managed EDW equivalent. The EDW or production Hadoop environment, if appropriate, comes after the business case, and business outcomes make it worthwhile to do so. At that point, all the technologies need to be back on the table under a fit-for-purpose architecture with a specific emphasis on governance and data quality. In other words, as you move from exploration to production, Gartner’s concerns become critical conditions. Please share any thoughts or questions in the comments.