Big data overkill can stunt scientific rigor

Big Data Evangelist, IBM

"Nothing is as practical as a really good theory." That's what one of my college professors once told us, and I've always believed it.

Newton, Einstein and the other giants of science didn't have big data. But they had big thoughts—and just enough samples of the right kinds of data—to show that their informed speculations were exactly right. They had the practical vision to let their imaginations run ahead of the available evidence, while having the confidence that the data and analyses to confirm their hunches would soon emerge from scientific investigations.

As science pushes computational approaches to their limits and mines ever-growing pools of observational data, it is important to recall how the smartest people of prior eras addressed many of the same issues we still confront. Science has always progressed when brilliant theoreticians propose concise new explanations that they, or others, can test empirically. The emphasis is on concise: hypotheses that other scientists in their field can wrap their heads around and test in a reasonable amount of time, at a reasonable cost and with available tools and techniques that enforce scientific rigor. If nothing else, overcomplicated theories tend to run afoul of the conceptual "Ockham's Razor" while also tending to consume an inordinate amount of resources over the course of scientific investigations. Simple theories that drive simple experiments are often the most persuasive.

As a tool of science, big data's sheer bigness can be its own worst enemy. For starters, there's a practical limit to the use of brute-force computation to execute some statistical models. As one researcher notes in this recent article, “Many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data. Faced with this situation, gatherers of large-scale data are often forced to turn to ad hoc procedures that...may have poor or even disastrous statistical properties.”

Another serious issue, according to article author Tom Siegfried, is that big data's all-consuming storage clusters encourage researchers to mix data sources of varying degrees of quality, thereby diluting the validity of subsequent findings. "Big Data often is acquired by combining information from many sources, at different times, using a variety of technologies or methodologies....[which] 'creates issues of heterogeneity, experimental variations, and statistical biases." This sort of big data meta-analysis is becoming more prevalent in science and business alike, and it's a slippery slope toward utter disregard for source data quality.

An even more fundamental issue with big data's use in science stems from its awesome power in crunching through entire populations of empirical evidence. This can lull researchers into accepting invalid models as the gospel truth. According to article author Siegfried: "Not only do Big Data samples take more time to analyze, they also typically contain lots of different information about every individual that gets sampled, which means, in statistics-speak, they are high dimensional. More dimensions raises the risk of finding spurious correlations—apparently important links that are actually just flukes."

As the number of factors (aka dimensions) under investigation grows, so do the potential interrelations among them and, of course, so does the observational data that comes when we try to measure it all. High-dimensional modeling is the biggest resource hog in the known universe, and, left unchecked, can quickly consume the full storage, processing, memory and bandwidth capacity of even the most massive big-data cluster.

Just as Heisenberg's Uncertainty Principle defines the outer limits of deterministic modeling in the sciences, there might be an Uncomputability Principle describing the practical limits of big data analytics in high-dimensional modeling. As Yaneer Bar-Yam of the New England Complex Systems Institute states in a recent study: “For any system that has more than a few possible conditions, the amount of information needed to describe it is not communicable in any reasonable time or writable on any reasonable medium."

However, a powerfully concise theorem can encapsulate, crunch and clarify the entire universe with stunning efficiency. Where are the Einsteins of the future going to come from? Will they be the quantitatively-oriented scientists? Or will they be the theoretical scientists? Who is more likely to start with cogent thought experiments? And who's more likely to start by cranking up their Hadoop cluster full blast before they've assembled their thoughts?

The best data science demands that you hone your conceptual skills at the same time you're collecting and sifting through your data. Lacking theoretical bearings at the outset of your data-science project, you're likely to fall into a vicious cycle of brute-force cluelessness.