In-Hadoop analytics

Program Director, IBM

I've been a proponent of the data warehouse for a long time: I cut my teeth on decision support and OLAP systems way back when, travelled and presented with Ralph Kimball and invented technologies that brought analytics closer to the data warehouse. So you won't be surprised to hear me say that your big data systems should be working hand in hand with the data warehouse, and analytics should be applied to both data sets.

Data questions

Your data warehouse typically contains master data, operational data and financial data—this data is used to help understand the business. There are feedback flows to operational systems to help with operational decision making, but the majority of questions the data warehouse answers are about the business in aggregate, and they're in the form “what happened?” These are the questions that help us understand where the business is, so we can plot a course and steer into the future. Think of the data warehouse as the navigational system for the business. At any time you can ask “where are we?” and thanks to the data warehouse, you will get a very accurate answer.

Datagram_v21-082614.jpgOnce you know where you are, the obvious next question is “where are we going?” That's a more difficult question to answer, particularly from the data warehouse where the data has been cleansed, transformed, conformed, aggregated and otherwise managed to make it consumable. The processes that help us understand this data also mask our ability to apply analytics to the deep details that help us understand where we're going.

Beyond the data that's landing in Hadoop on its way to the data warehouse, there's an awful lot of data that's not going anywhere. Some of it may find its way into an archive, some not, and yet there's valuable information to be found in this data if it can be stored and analyzed. What can your machine logs tell you about utilization and waste? What can your web logs tell you about customer behavior? What can your call center logs tell you about your products and problems? What can the details tell you about risk, fraud, efficiency and what it's like to be a customer of yours? What does the business look like if you blend this data together?

Then extend your horizon and look outside the business. What are your customers saying about your products? Are they happy? Who is buying? Who wants to buy? Who is buying a competitive product, and why? How can you reach more potential customers and what can you do to help them decide to buy?

Making it consumable

Traditional extract, transform and load (ETL) processes make operational data consumable from the data warehouse. As we move further away from operational data and start to make use of deeper, broader and richer big data, analytics play a key role in making it consumable. Much of this data is text, so we need powerful text extraction capabilities that can structure interesting entities from this complex free text. Also, much of this data is extremely detailed and needs to be subject to machine learning algorithms that can present an aggregate view based on deep analysis.

Statistical and data mining techniques should be applied to this raw and detailed data. Geospatial analysis, video, audio and temporal analysis all have potential applications if we have a platform that can store the atomic data and efficiently apply compute intensive algorithms to large data volumes, which is where our Hadoop based big data systems shine because they allow us to bring the algorithms to the data.

As a data warehouse advocate, I want to see this new information find its way into the data warehouse. For that to happen, we need to land the new data in Hadoop, analyze it, cleanse, transform, conform, aggregate it and then flow the results into the data warehouse where it adds new value, allowing business users to understand both where we are and where we're going.

Related resources