Healthy Data: Setting Achievable Goals

Distinguished Engineer, IBM

So far, we’ve discussed the definition of “healthy data” and provided an overview of the process for getting there. In the next few installments we’ll focus on each phase. We’ll start with a deeper discussion on setting achievable goals.

Great things are possible with discipline and hard work. While I can only speak for myself, it is certainly clear to me that no matter what regime of diet and exercise I undertake, I’ll never be a world class marathon runner. In fact, given my age and physique, I’d be lucky to be able to complete a marathon at all.

Fortunately for my self-esteem, running a marathon has never been a personal goal of mine – let alone becoming a world-class runner. It is nice how expectations and reality can sometimes coincide.

What’s important to recognize is that everyone is different – both in terms of their current status and potential.

What people – and information systems – can transform into depends in part on what they are today, as well as the environment and available resources. As we look at what we can achieve, we need to recognize the current health of our data – and the information systems that produce, transform and consume that data. Like dieting and exercise, we need to define realistic targets and timelines for what we want to achieve.

But what do we want to achieve? A simplistic answer is that we want to know, with reasonable certainty, that the results of processing are not only believable, but also actionable. Let’s take a few simple example:

· If you want to understand the growth of new customers by region, you need to aggregate and cleanse both the region and customer information. Absolute precision may not be required, but you want the information to be good enough so that you can normalize the regions and do rough customer counts. You may not even need all the data: sampling can sometimes be nearly as accurate and is much more efficient.

· If you want to provide premium service for your best customers, you may want to compute a score such as Total Customer Potential for each customer. Here you probably need a reasonable confidence that the information fed to the scoring algorithm is correct and up to date.

· If you want to analyze customer interactions for inappropriate or fraudulent activity you probably need a much higher degree of confidence in your data. At the very least, you need to know how confident you should be.

The way we need to use the data drives the level (and kinds) of quality and completeness. Better quality data can be used with confidence for more things – but lesser quantities and qualities of data can sometimes be acceptable.

What are reasonable targets for you? How clean does your data need to be? How much data do you need for your analytics? How long do you need to keep the data? Are you a data packrat? Do you need to be?

In terms of our health analogy, we need to set achievable goals for both our calorie intake (data ingestion) as well as our exercise plan (data cleansing and preparing). In the next installment we’ll take a deeper look at planning for thoughtful consumption and judicious exercise.