When Dependent and Independent Variables Are Not Enough

Why data scientists need other data components to build, measure, and monitor predictive models

The capability to predict or forecast specific outcomes for informed decision making is vital to many organizations spanning a wide range of industries including insurance, financial, customer relationship management, and others. As predictive models are increasingly implemented to capitalize on big data for influencing business decisions or enhancing operations, data scientists need to look beyond the models’ dependent and independent variables for the data necessary to accurately assess the models’ effectiveness.


Building the model

The data component for a predictive model generally consists of a dependent variable and multiple potential independent variables—predictors—gathered from historical data sources. Collecting, cleaning, and manipulating the data usually consumes about 80 percent of the total timeline for building the model. Inexperienced data scientists sometimes try to shortcut this process but find that data errors and spurious results lead them back to taking this step.

Besides the dependent and independent variables, experienced data scientists also gather data for three important data components of predictive models. These components include the data to calculate a model’s return on investment (ROI), data for monitoring its short- and long-term quality, and data for enhancing its output usefulness.


Calculating return from using the model

All new projects—including building predictive models—should be measured by the revenue generated or costs reduced so they can be compared to other projects competing for the same limited investment capital. Even nonprofit organizations need to carefully monitor cash flow so they can keep their doors open. As a result, data scientists need additional data to calculate ROI for the model, typically using either net present value (NPV) or internal rate of return (IRR)—two popular methods for calculating ROI.

Some fixed costs related to building the model will be required, and some ongoing costs will be necessary to run it. Hopefully, these costs can be more than offset by incremental revenue, incremental reduced costs, or both. Many data scientists benchmark the current cost and revenue levels so they can then compare costs and revenues after the model is deployed. Ongoing monitoring helps determine the usefulness of the model over time.


Monitoring short- and long-term quality

Data is also needed for monitoring the model for quality over time. Models tend to decline as their lifecycle advances. One rather extreme example is a model for forecasting telephone usage that was built 50 years ago. This model would not be accurately predictive in today’s economy and environment. When it was built back in 1964, generally there was one telephone per household and long-distance calls were expensive. Today almost everyone has a cell phone, and many long-distance calls are free.

Many models are rebuilt or recalibrated every four or five years, depending on how quickly they deteriorate. Collecting and trending error rates over time are necessary to monitor the deterioration rate.

For short-term quality, outlier ranges and corresponding alerts can identify real-time outages. For example, daily patient admissions for a large hospital chain are usually within 10 percent of the moving average. If the admission counts for today are about 50 percent below normal, for example, an alert text should be sent to the monitoring party because this level is outside the outlier range.


Enhancing the usefulness of the output

Prior to building a predictive model, the output should be discussed with those who will use the model. One question posed to the model’s users might be, “Which data fields would you like to see besides the model scores?” Experienced modelers should also suggest fields they want to use.

In a prospecting model built for a property and casualty insurance company, for example, several useful data fields were included in the output list. One field was for the output of historical profitability within the related industry, while another was a field for a five-year growth forecast. While the model provided a likelihood score for buying an insurance policy, the extra data helped the sales representatives make an informed decision about calling on potential customers. In another model, adding a dollar-value dial was an enhancement that users liked even though the client organization didn’t request it.


Providing the requisite data

To help data scientists assess the effectiveness of the predictive models they build and utilize, they require much more data than just the models’ dependent and independent variables. Data warehouse managers can be proactive by discussing with the data scientist their future needs for these additional data types. Please share any thoughts or questions in the comments.