Healthy Data for Smarter Analytics

Distinguished Engineer, IBM

We all want to implement smarter analytics - to turn information into insights that we can use to improve both our personal and professional lives.  Yet we are often faced with a fundamental, often nagging, problem of getting the right data for whatever analytics we are trying to implement. It’s not that we don’t have data - although certainly that is sometimes the problem. It’s more that we don’t have the right data.

What is the right data? There are a lot of characteristics that, when combined, make us believe that data is right.  We’ll explore many of these characteristics in future posts, but to simplify dramatically, we need data that is correct, clean and current. 

In discussions with clients from diverse organizations, I’ve come to recognize that there are many challenges to finding, preparing and providing such data, particularly for analytics. Why particularly for analytics? Well, if we want to make important decisions based on the insight gleaned from analytics, then we want to believe that we have meaningful results. And following the well-known Garbage-In-Garbage-Out principle, if the wrong data goes in, we will likely get the wrong answer out. So if we are trying to make important decisions based on these answers, it’s pretty important that we are feeding our analytics with good data.

When we peer into organizations’ data, the quantity, diversity, quality, and distribution can be overwhelming.  In fact, we might view it as unhealthy - at least for the purposes of analytics.

Of course, the information requirements for different analytics depend on what you are trying to do - and how.

If you are trying to understand customer buying behavior over the last six months, you probably don’t need years worth of data. However, if you are trying to create a reusable data set for many kinds of analytics, perhaps you do need many years of history. If you are trying to find your most popular products by geographic region, then you need to make sure that the data you feed in has a set of consistent and accurate product and region identifiers.

What we want are bodies of healthy data: sets of data with the right amounts of the right data that is structured, reconciled and cleansed for the different kinds of analytics that we want to perform.

With this notion of healthy data, we can now start to look around and ask if our data is healthy - and if not, how do we get to healthy data? This blog series will focus on discussing just this question. Not that I’ll have all of the answers. I’m sure that many of you will have your own views - and often they’ll be just as valid as my own (except of course when I’m wrong).  In the next installment, we’ll talk about the overall process for getting to healthy data.