Big data's bogus correlations

Big Data Evangelist, IBM

Detangling the truth from the spurious, the curious and the injurious

Coincidences are correlations, of course. However, they're the weakest possible variety. They're so rare as to be essentially unprecedented and so non-causal as to be essentially unpredictable. And you wouldn't be able to produce a valid statistical model to account for their occurrence or re-occurrence.


Image courtesy of Openclipart and used with permission

Considering how big data's spurious correlations far outnumber the valid causal relationships, it's always best to approach them all with a healthy skepticism. Some people, such as data scientists, might have a personal or professional interest in believing the spurious. These beliefs, plus the natural human tendency to succumb to confirmation biases, tend to short-circuit the skepticism and critical thinking that we expect from the best analysts. The more intelligent and authoritative a data scientist or analyst is, the more likely they are to convince unwitting others that their wishful correlations are indeed valid.

Sounding a cautionary note on all this, Gary Marcus and Ernest Davis recently published an excellent piece called "Nine large problems with big data.” The issue they discuss is not so much with big data itself as it is with the human tendency to unthinkingly and uncritically accept the spurious correlations that big data reveals. Detangling the truth from the spurious demands critical thinking and relentless pushback.

What follows are the pithy names that I (not Marcus and Davis) give to the problems they describe. Exercise caution when evaluating statistical correlations that may suffer from any or all of the following problems:

  • Fluke correlations: One of the bedrock truths of statistics is that, given enough trials, almost any possible occurrence can happen, though you might need to wait a long, long time. The more possible events that might take place, the more potential, albeit unlikely, "fluke" events there are. As Marcus and Davis state, "If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant, even though there is no actual meaningful connection between the variables." And given a big data set of enormous volume, velocity and variety, you're likely to see many fluke correlations every time you look into the data.
  • Ephemeral correlations: Some extreme correlations may jump right out at us and scream "Significant!" only to fade upon repeated observations. Though they may not be statistical flukes, such correlations may vanish under the influence of the statistical rule known as "regression toward the mean.” These are non-robust correlations of the sort that may be borne out by historical data sets but, when encoded in predictive models, fail to be replicated in future observations.
  • Uncorroborated correlations: Statistical correlations that are based purely on number crunching are weaker than those that were also tested and corroborated in the laboratory or other real-world scientific methodology. In other words, data science alone isn't enough to establish the bedrock validity of a correlation. You need what might be called "domain science": investigated using traditional methodologies by actual physicists, economists and others who are domain experts in the topic that is also being explored statistically.
  • Artifactual correlations: Some correlations may be produced by "artifacts" that are entirely separate from "natural" factors being studied. Artifacts might dilute, distort or reverse any "natural” correlations being studied statistically. Marcus and Davis call attention to big data as one such artifact, producing "vicious cycles" of distorted correlations in cases where the "the source of information for a big data analysis is itself a product of big data." One example is any attempt to study how texts are translated into natural languages, when the source of texts is Wikipedia entries that may have been auto-generated from each other via a common program such as Google Translate. "With some of the less common languages," says Marcus and Davis," many of the Wikipedia articles themselves may have been written using Google Translate. In those cases, any initial errors in Google Translate infect Wikipedia, which is fed back into Google Translate, reinforcing the error."
  • Wrongheaded correlations: Some correlations proceed from metrics and data that should never have been developed in the first place. This happens a lot in pseudo-science when someone attaches numbers to fundamentally qualitative phenomena, or attaches the wrong numbers to a quantitative phenomenon. Marcus and Davis call attention to recent statistical rankings of people's "historical importance” or “cultural contributions” based on data drawn from Wikipedia. One such study ranked Francis Scott Key as the 19th most important poet in history, and another named Nostradamus the 20th most important writer in history. Nobody in their right mind who has ever studied literature would make these same claims. These are wrongheaded analyses being foisted on the gullible under the guise of big-data validity.
  • Hyped correlations: Hype can infect the narrative that frames any correlation, no matter how well that correlation is grounded in the subject domain, the data and the statistical models. An overzealous advocate for that correlation will tend to exaggerate its importance, wrench it out of context, deflect criticism and distract listeners away from the constraints and caveats that should frame every correlation throughout its natural life.

Beware, beware, beware! The sexier the correlation, the more likely any of us is to accept it uncritically. And the more uncritical we are about big data's potential for statistical abuse, the more damage that can be done if we live our lives under that assumption that bogus correlations are the gospel truth.