Data science's limitations in addressing global warming

Big Data Evangelist, IBM

Data science is not a magic bag of tricks that can somehow find valid patterns under all circumstances. Sometimes the data itself is far too messy to analyze comprehensively in any straightforward way. Sometimes, it's so massive, heterogeneous and internally inconsistent that no established data-scientific approach can do it justice.

When the data in question is big, the best-laid statistical models can only grasp pieces of its sprawling mosaic. As this recent article notes, that's often the case with climate change data, which is at the heart of the global warming debate. Authors James H. Faghmous and Vipin Kumar state their thesis bluntly: "Despite the urgency, data science has had little impact on furthering our understanding of our planet in spite of the abundance of climate data….This...stems from the complex nature of climate data as well as the scientific questions climate science brings forth."

What's most instructive about their discussion is how they peel the methodological onion behind statistical methods in climate-data analysis. The chief issues, they argue, are as follows:

  • Graphic Data Science Limitations in Addressing Global Warming 87341623.jpgHistorically shallow data: Modern climate science is a relatively new interdisciplinary field that integrates the work of scientists in meteorology, oceanography, hydrology, geology, biology and other established fields. Consequently, unified climatological data sets that focus on long-term trends are few and far between. Also, some current research priorities (such as global warming) have only come into climatology's core radar over the past decade or so. As the authors note, "some datasets span only a decade or less. Although the data might be large—for example, high spatial resolution but short temporal duration—the spatiotemporal questions that we can ask from such data are limited."
  • Spatiotemporal scale-mixing: As a closed system, the planet and all of its dynamic components interact across all spatial scales, from the global to the microscopic, and on all temporal scales, from the geological long-term to the split-second. As the authors note, "Some interactions might last hours or days—such as the influence of sea surface temperatures on the formation of a hurricane—while other interactions might occur over several years (e.g., ice sheets melting)." As anybody who has studied fractal science would point out, all these overlapping interactions introduce nonlinear dynamics that are fearsomely difficult to model statistically.
  • Heterogeneous data provenance: Given how global climate data is, it's no surprise that no single source, method or instrumentation can possibly generate all of it, either at any single point in time or over the long timeframes necessary to identify trends. The authors note that climate data comes from four principle methodologies, each of them quite diverse in provenance: in situ (example: local meteorological stations), remote sensed (example: satellite imaging), model output (example: simulations of climatic conditions in the distant past) and paleoclimatic (examples: core samples, tree rings, lake sediments). These sources cover myriad variables that may be complementary or redundant with each other, further complicating efforts to combine them into a unified data pool for further analysis. In addition, measurement instrumentation and data post-processing approaches change over the years, making longitudinal comparisons difficult. The heterogeneous provenance of this massive data set frustrates any attempt to ascertain its biases and vet the extent to which it meets consistent standards of quality and consistency. Consequently, any statistical models derived from this mess will suffer the same intrinsic issues.
  • Auto-correlated measurements: Even when we consider a very constrained spatiotemporal domain, the statistical modeling can prove tricky. That's because adjacent climate-data measurements aren’t often not statistically independent of each other. Unlike the canonical example of rolling a dice, where the outcome of each roll is independent of other rolls, climate-data measurements are often quite correlated with each other, especially if they're near to each other in space and time. Statisticians refer to this problem as "auto-correlation," and it wreaks havoc with standard statistical modeling techniques, making it difficult to isolate the impacts of different independent variables on the outcomes of interest.
  • Machine learning difficulties: In climatological data analysis, supervised learning is complicated by the conceptual difficulties of defining what specific data pattern describes "global warming," "ice age," "drought" and other trends. One key issue is where you put the observational baseline. Does the training data you're employing simply describe one climatological oscillation in a long-term cycle? Or does it describe a longer-term trend? How can you know? If you intend to use unsupervised learning, your machine learning model may fit historical data patterns. However, the model may suffer from a statistical problem known as "overfitting": being so complex and opaque that domain scientists can't map its variables clearly to well-understood climatological mechanisms. This might make the model useless for predictive and prescriptive analyses.

In spite of all those issues, the authors don't deny the value of data-scientific methods in climatological research. Instead, they call for a more harmonious balance between theory-driven domain science and data-driven statistical analysis. "What is needed," they say, "is an approach that leverages the advances in data-driven research yet constrains both the methods and the interpretation of the results through tried-and-true scientific theory. Thus, to make significant contributions to climate science, new data science methods must encapsulate domain knowledge to produce theoretically-consistent results."

These issues aren't limited to climate data. Those same data-scientific issues apply to other heterogeneous data domains. For example, social-network graph analysis is a young field that has historically shallow data and attempts to analyze disparate sources, both global and local. How can data scientists effectively untangle intertwined sociological and psychological factors, considering that auto-correlations in human behavior, sentiment and influence run rampant always and everywhere?

If data science can't get its arms around global warming, how can it make valid predictions of swings in the climate of world opinion?