Healthy Data: A Diet of Thoughtful Consumption

Distinguished Engineer, IBM

There is a lot of tempting data out there! Both within the enterprise and in public spaces, all sorts of interesting looking information is increasingly available. Sometimes it's just so tempting to gather everything that looks interesting whether or not you have an immediate need for it. Maybe you’ll need it later? Maybe.

There are many similarities between data and food. Food fuels our bodies, data fuels our IT systems – especially our analytical systems.  Eating right is a key enabler to our health. Consuming the right amounts of the right data can be a key enabler for insightful analytics.  So just as we try to be thoughtful in our eating habits (with, of course, the occasional exception), we should also be thoughtful in our data consumption.

Here are three rules to think about in consuming data:

  • Consider a balanced diet
  • Understand before you consume
  • Don’t over-eat

We’ll briefly explore each of these rules below…but is something missing? Yes. We don’t always want to ingest raw data. Often we need to prepare the data for healthier consumption  –  and better taste. We’ll leave a further discussion of data preparation to the next posting.

1) Consider a balanced diet: To stay healthy we need a mixture of the right kinds of foods. Fortunately foods are easily categorized – we have fruits, vegetables, meats, breads, etc. – and usually it’s easy to distinguish between them.  Categorizing information for use in analytics is not as easy. We may need to look for information based on topic, ownership, accessibility, trust, location, and so on.  Often we need to consume sets of data together for the combination to be meaningful. For instance, to make sense of a set of sales transactions, we probably need to combine it with the right kinds of reference data to decode the fact that a sale of product TH789 to customer BG8973 is really the sale of a hammer to Sue the carpenter. A balanced diet of data considers what it will take for the combination to have all the right ingredients for healthy and tasty analytical results.

2) Understand before you consume: We want to make sure that our food is fresh and of good quality. Sometimes we even want to know where its come from – or trust others to track that for us. Of course we want to make sure it’s ours to consume, that we aren’t allergic to it, and that it fits into our overall balanced diet. Data is very similar. Before we consume information, it’s a good idea to make sure that it is of acceptable quality, currency, provenance and ownership. Sometimes we will want to understand the information even more deeply – to profile the data using tools to understand it more completely – including effective structure, consistency and value ranges.

3) Don’t over-eat: It is all too easy to take more food than we can eat. The eyes are often bigger than the stomach.  It’s just as easy to collect more information than you need for the analytics you want to perform. Sometimes we can even use statistical sampling of data rather than complete data sets and achieve the same insight. We might find that analysis on data collected twice a day provides no more insight than if we collected it only once a day. We generate, and often have access to, much more information than we can use. If we are thoughtful in selecting what and how much data we need, we can be more efficient in processing what we do consume and perhaps establish a higher degree of confidence in the quality of the results.  When we collect more data than we can consume immediately, we may want to store if for use later. Sometimes this means preparing the information for storage and later consumption. But remember, most data, like most food doesn’t have an infinite shelf life.

In summary, thoughtful selection of information is key to healthy analytics. We should evaluate information as we collect it, decide what we can use now and what we should keep for later. In both cases, we may need to prepare the information.