Data confidence: The proof's in the process

Big Data Evangelist, IBM

When people say about some found object, "don't touch it, you don't know where it's been," they might as well being speaking of data. You can't use any data with confidence until you ascertain where it came from, who handled it and what they did with it. Until you know its full provenance, you have to take an Olympic-record leap of faith on any data that seemingly dropped in out of the clear blue.

Data governance is a process, but not all processes are designed to foster tip-top trust in their end results. It's not always the fault of data-governance professionals or of the processes they administer to cleanse and prepare data for downstream consumption. Often, the fault resides in the processes under which data comes into being, which can compromise its trustworthiness from the start. Or it might even reside in processes outside the data's creation that constrain its utility from the point of view of downstream uses.

In practice, it's hard to characterize any data domain as inherently trustworthy (or otherwise). That's because sound governance processes might be applied (or not) to any data domain. This is the reason why I took issue with Jerry Thomas' recent ranking of nine specific marketing-relevant data domains from "from most trustworthy to least." But, when you take a second look at his reasons for ranking them thusly, you see that governance processes are implicit in much of his discussion.

Here's how I read process variables into Thomas' discussion of data-domain trustworthiness. Confidence in data-driven insights depends on whether the data and its interpretations have been subjected to the following process controls:

  • Experimental controls: In his discussion, Thomas implicitly ranks data that was gathered in controlled experimental settings as the most trustworthy. About experimental processes, he lauds them as being "carefully designed and carefully controlled [,] conducted by objective third parties who are experts in such experiments [in which] before-after and side-by-side controls are employed, along with sophisticated statistical analyses." He essentially makes the same argument in favor of survey research data, stating that it benefits from "research design [involving] stimulus controls, statistical controls, [and] quality-assurance standards [that] tend to make the data very precise."
  • Statistical modeling controls: Where experimental design is not a factor, Thomas implies that statistical modeling controls are a must for data to be considered trustworthy. As regards marketing-mix and media-mix modeling data, he says they can be trusted to the extent that the "creation of an analytical database [involves] the cleansing and normalizing of that data, and the use of multivariate statistics and modeling to isolate and neutralize some of the noise." He says these processes "tend to make [such] data better than actual sales data [from which they were derived]."
  • Data-interpretation controls: Many of the data domains that Thomas considers low in trustworthiness suffer from what one might consider difficulties of interpretation. As regards sales data, he essentially argues that its validity as a true barometer of customer demand is contaminated by such extrinsic processes as "the economy, competitive activity, the weather, inflation, the vacation cycle, news events, political events, aberrations in inventories and distribution, pricing disturbances." As regards eye-tracking, biometric, and physiological data on customer behavior, he says those can't be trusted as indicators of customer sentiments and propensities unless the interpretation process involves correlation with extrinsic data from survey or qualitative research.
  • Data-provenance controls: The "you don't know where it's been" factor contaminates social media data trustworthiness, says Thomas (this data's decision-support utility is also impacted by the same extrinsic processes that affect the interpretability of sales data). "As social media comments are identified and collected via Web scraping, we almost never know the exact source, the context, the stimulus, or the history that underlie a comment. These unknowns make interpretation risky, indeed."

Clearly, trustworthiness is not intrinsic in data. You can't trust data unless you have visibility into the entire process under which it was created, handled, interpreted, and applied.