4 Kinds of Exercise for Healthy Data

Distinguished Engineer, IBM

We are often told of the importance of exercise to overall health. Stretching, cardio and strengthening exercises each contributes to maintaining a fit physique.

In the context of healthy data, we also need a range of activities to provide the right data to the right place at the right time – both for analytical consumption and any other kind of processing. We aim to thoughtfully balance the accumulation of new data with the reduction in data through archiving or deletion. Decomposing this a little further, we can think about four kinds of exercise:

 1)  Collect – The first step is for us to find and access information of potential interest and then assess whether we think it might be suitable for our purpose. We might, for example, profile a sample of the information to determine some key characteristics including quality and structure. After assessment, we may determine that we need to adapt the information in some way for further processing.

 2)  Process – The next step is to gain some deeper insight into the information we have collected and an understanding of the information that we need to deliver. We work through how information from multiple sources should be combined and transformed to satisfy the needs of the consuming applications – and then execute the appropriate transformations. Transformations include reshaping the information, performing computations and addressing quality and consistency issues.

 3)  Distribute - The timely delivery of the transformed and cleansed information takes place in this step. Here we use the appropriate technologies to ensure that the consuming applications can proceed with their work as required.  Consuming applications may have a range of requirements spanning the range from periodic to near real-time and from single record at a time to large data sets.

4)     Manage  - Governance over each of these steps is key to providing the confidence that the right information is truly being delivered to the right consumers. Information governance encompasses life cycle, protection and quality of the data. Life cycle management covers many facets from creation to archiving and secure deletion. Protection includes handling of privacy, access control and visibility of information. Information quality covers many aspects including standardization, de-duplication and currency. Information policies and procedures to manage information life cycles are implemented and monitored through a combination of automated and manual efforts.  A key aspect of management is monitoring and reporting of key indicators to provide an ongoing understanding of the success of the management systems.

Learning to how and when to exercise and perceiving the benefits takes time, effort, experience and knowledge. It takes thoughtful practice following a routine that helps us build our strength, stamina and flexibility. The exercises above are no different. We need to incrementally build our capabilities, expanding both the completeness of our exercises as well as our coverage over more and more kinds of information.  We need to set appropriate expectations, track our results, recognize our success – and then set our sights on even higher goals.  The next post will focus on this critical process of establishing and tracking your progress towards achieving and maintaining healthy data.

data, healthy