Data Scientist: Bias, Backlash and Brutal Self-Criticism
Data scientists such as Nate Silver have recently begun to receive rockstar status in the big-data universe. That’s a tricky status to sustain for long, because it inevitably inspires popular backlash. You can already see that backlash gaining force, as evidenced through the growing volume of popular discussions of “data-scientist bias,” such as this article and this one.
Nobody’s infallible, of course, and data scientists are only human. That’s not news, considering that every student of science, statistics and business management is sensitized to potential biases of in their analyses as well as those of other people that they may engage with on some level.
Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar and discipline rigorous, there’s no denying that bias may creep inadvertently into their work. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.
Here are some of the key sources of bias that may crop up in a data scientist’s work:
- Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors—such as a misunderstanding of probabilities—rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
- Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
- Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
- Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
- Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data and applications to favor the interests of the party—employer, customer, sponsor, etc.—that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.
Clearly, bias is a factor in every aspect of human life. Often, the biases are on the side of the angels: ensuring that all parties focus on creative and practical solutions to common problems that they all face. But biases may also reinforce received wisdoms—including those that we want science to question—long after those wisdoms are out of touch with reality.
Brutal self-criticism is the hallmark of the best data scientists. One of their core responsibilities is to explain, defend and critique their work on any and all levels. In order to earn their rockstar status, they must strive constantly to eliminate any biases that could skew their models away from a valid, data-driven portrait of their subject domain.
Continue the discussion & check out these resources
- Here is an IBM webpage defining the data scientist.
- More blog posts, videos, podcasts and presentations about data scientists on IBM Big Data Hub
Please engage us and let’s continue this exciting discussion.