The Monkey Wrench That Can Slow Down Analytics

Apply these tips and tricks to resolve data quality problems for efficient analysis of big data

Many industries have data quality problems to varying degrees. I have experienced problems with data quality in five industries I have worked in: banking, healthcare, high tech, property-casualty insurance, and telecommunications. As a predictive analytics scientist, I have seen firsthand how poor data quality slows down model building and also affects the accuracy of those models.

The benefits of experience

Following are some tips and tricks accumulated over the years while working in these industries. They can help data scientists and other big data analytics professionals to recognize and resolve some common problems with data quality.

Balanced data granularity

Data granularity refers to the level at which the data field is subdivided. Either too much data granularity or not enough data granularity can impact the quality of the predictive model. For example, too much granularity can lead to slow and overly detailed models. However, this problem can be avoided by grouping similar codes into a new derived variable.

Note that similar codes refer to those codes having similar propensity percentages. This technique is particularly effective when alphanumeric values are grouped into new numerical categories because there are several predictive models that don’t accept alphanumeric fields. On the other hand, not enough granularity in the independent variables can hinder the model’s ability to create actionable segments.

Excessive poor-quality data

Data is often not clean. Model builders expect to clean up the data before using it. Excessive poor-quality data with numerical data in alphanumeric fields, such as state codes or alphanumeric data in numerical fields such as zip codes, can decrease the effectiveness of the predictive model. Data missing from particular fields can also be a problem.

Predictive models can overcome some missing variables in the records. However, variables that have excessive numbers of missing entries should be omitted from the model because of low record counts. A new derived field can be created for fields with only a small percentage of missing data. For example, if the field has only positive numerical values, then code the null, blank, empty, and not-available data to be equal to −1. This way, the entire record is not bypassed, which helps avoid losing the valuable other fields.

Small history file

A small history file can adversely affect models such as decision trees. The data should be divided into two partitions, one for developing the model—called training data—and another for testing the model—called testing data. Within each partition, the model creates multiple separate tree nodes for each model segment. If the history file is too small, the nodes will have small populations, which could potentially result in spurious findings.

Unbalanced dependent variable

For binary dependent variables, the predictive model will more accurately learn the patterns in the historical file, if the data is equally balanced between the two options. However, many organizations do not have equally balanced history files. Usually, the scales are tilted toward one side or the other.

In these situations, the model builder should oversample from the smaller side so that the training history file is equally balanced between the two types of options. After the model is built, the testing data set is kept unbalanced so that the model results reflect the reality of the historical file. If the training history file is kept unbalanced, the model will learn mostly from the larger side and create a biased model.

Included post-event variable

If the predictive model is run after a certain event, then only variables that are created before that event should be included. Adding variables from the history file that are created after the event date is tempting because they can be strongly correlated with the dependent variable. These variables create a strong but false model. When the model is run with actual data, the post-event variables will not yet be populated and cannot contribute to the predictive power of the model.

Unreliable dependent variable

The dependent variable may be unreliable because of coding problems. In the healthcare industry, for example, an unreliable dependent variable is sometimes a problem in historical payments on the hospital’s closed accounts report. The codes do not reveal whether the payer or the patient made the payment. This problem is a deal-breaker for developing a predictive model. A model should not be developed without a reliable dependent variable.

Vital communication

Data warehouse managers can be proactive in creating derived independent variables for data scientists and other users. For a long-term solution to data quality problems, initial input data editors can be created to block alphanumeric inputs into strictly numeric fields and vice versa. If the data quality problems can be traced to certain individuals, perhaps additional training is warranted. Communication between data scientists and data warehouse managers is important to jointly help improve the quality of data.

What do you think about resolving data quality problems and avoiding this monkey wrench in the data warehouse? Please share your thoughts or questions in the comments.

[followbutton username='IBMdatamag' count='false' lang='en' theme='light']