Data scientists need to nip model overfit in the bud

Big Data Evangelist, IBM

One hallmark of pseudo-science is the theoretical model that explains the past too well. Another hallmark is when that same model repeatedly fails to foretell the future but is always revised ad hoc to explain away its inadequacy. For example, think of all the sophistic reasoning that creationists employ to explain away inconvenient evidence (like dinosaur bones) that contradicts the Book of Genesis.

In data science, the equivalent phenomenon is the statistical model that fits historical data like a hand in a glove, but fails miserably at predictive analysis. One key difference from the creationism analogy is that this data science phenomenon, known as “overfitting,” is not the handiwork of charlatans. Rather, it’s an unfortunate, but all too common, consequence of top notch data scientists attempting to refine their statistical models.

Overfitting_blog.jpgOverfitting is another type of data science bias. Per my discussion in this 2013 post, overfitting stems from the tendency to skew data science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics of fitness.

Though it may be inadvertently committed by data scientists of high integrity, overfitting carries the same risk as pseudo-science: encouraging people to place too much trust in the explanatory or predictive power of an interpretive model of some empirical domain. That fact is underlined in this great 2005 article by John Langford. In it he defines two styles of overfitting: “over-representing performance on particular datasets and (implicitly) over-representing performance of a method on future datasets.”

Clearly, the greatest risk from overfitting stems from this latter sense: implying the robust predictability of some statistical model against any future data in the relevant domain. That overconfidence rubs against the grain of a cardinal imperative in data science: the need to regularly score your statistical models, which were initially trained against a training data set, against fresh observations in order to test their continued predictive power. Without the ability to score models against new empirical data, we can have little confidence in their continued predictive fit to our domain of interest.

If we continue to place faith in those models, regardless of whether or not they’ve been freshly scored, we risk slipping into “pseudo-data-science” territory. Or, just as bad, we risk giving causative credence to spurious, perhaps ridiculous, correlations derived from an overfitted model. The cited article shows one such spurious overfit correlation: between the age of Miss America and the incidence of murders by steam, hot vapors and hot objects.

Professional data scientists are well aware of the dangers of overfitting. For them, the core value of Langford’s discussion is in his breakdown of the principal causes of overfitting and his remedies for steering clear of them.

Here’s how I summarize the chief overfitting causes (all of which point to the appropriate remedies):

  • Model predictors that are too complex
  • Training examples that are too few, skewed and old
  • Models that have been accepted based on performance metrics that are inadvertently skewed in their favor or that are misconstrued in such a way as to encourage unwarranted confidence in those models
  • Problem definitions that are revised in order to improve the apparent performance of existing models that were developed under prior problem definitions
  • Acceptance of only those model-based findings that show the best results, rather than those grounded in the most valid data sets, parameters and methodologies

In addition to diluting the predictive performance of data science models, overfitting also tends to build up false downstream expectations among those of us who rely on these models’ insights to guide our decisions. We may inadvertently base a critical decision on some future scenario predicted by an overfitted model, thereby exposing ourselves to all the concomitant risks.

Consequently, weeding out overfitting whenever and wherever it occurs is the duty of every working data scientist. Without that vigilance and the procedural safeguards to minimize the incidence of overfitting, society as a whole will be less likely to trust that data scientists have their best interests at heart.