Small, Spotty and Zero Data: The Insights Present in Information’s Absence

Big Data Evangelist, IBM

Insights are funny things. They don’t necessarily depend on data, but, without corroboration in the objectively measurable, insights must be taken on faith. And that’s a slippery slope to superstition.

Sleuthing is everything. Insights are something you usually have to work for, and, more often than not, that involves vetting the data from every conceivable angle (what’s there, what’s not there, what it implies, what it leaves to the imagination, what you still have to dig for, etc).

Even the absence of data can deliver powerful insights, because it forces your mind to consider alternate explanatory scenarios that might never suggest themselves if you had all necessary information handed to you in advance. Not just that, but the presence of data that’s inaccurate, corrupted, or falsified (to the extent that you’re hip to those problems) can sensitize you to the need to hedge whatever insights you might gain from the available high-quality data. It’s too easy to let undetected data-quality issues skew your insights.

Some cynics regard big data as a “the more the merrier” data fetish. Their concern is valid to the extent that some people seem to regard data volume as a necessary and/or sufficient condition for high-quality insights. The “small data” backlash against big data seems to grow out of a popular concern we’re being encouraged to do data-overkill analysis in every possible scenario. I share those concerns and have written on the topic many times, including here.

In that vein, I took great interest in a recent article called “The #BigData, No Data Paradox.” What I found most interesting about it was its discussion, focusing on healthcare management applications, of scenarios involving three anomalous data categories: missing data, no-value data, and/or valid data that “blows the whistle” on possible data quality issues stemming from fraud, incompetence or negligence. This discussion is not about big data or small data, but, rather, about data that, by its presence or absence, flags business-process issues surrounding its collection, stewardship or management.

Author Carl Ascenzo memorably refers to these data categories as “not” data, “null” data and “naughty” data, respectively. Ascenzo, formerly CIO of Blue Cross Blue Shield of Massachusetts, presents very interesting examples drawn from real-world healthcare management scenarios:

  • “Not” data: “One example is pre-authorizations and the claim that never arrives. Although approved as medically necessary, the patient did not obtain the required diagnostic tests, medical procedures or services, durable medical equipment, or prescription drugs. The ability to detect this missing data in a timely fashion could trigger appropriate actions that could prevent medical issues or even save lives.”
  • “Null” data: “Electronic Medical Record systems can record events such as appointment schedules, test results and prescription orders but until the time of an occurrence contains null values. The null value is not an error per se, but represents the absence of an event or result. However, detecting the null value can at play a critical role, particularly with someone that is high risk, such as the canceled appointment that is not rescheduled, test results not reported, and authorization for prescription refills not requested.”
  • “Naughty” data: “Take, for instance, a health plan member who, at the request of their surgeon, calls the member service center to see if bariatric surgery is covered for reimbursement, and if so, wants to find out the qualifying conditions. In this case, natural language processing technology could detect “bariatric surgery,” setting a trigger to watch this member’s future claims for an appendectomy submission -- potentially a fraudulent coding for bariatric surgery.”

Clearly, data issues often stem from process breakdowns, both in the IT organization and in business units. Fixing the data quality issues on a one-time basis won’t necessarily repair the underlying process issues. The more pervasive the process issues and the longer it takes to detect, the more thoroughly your data sets will be riddled with “no,” the “null” and “naughty” bits.

If you don’t correct these process issues at their root, it won’t matter whether you’re using big data or small data. It will all be bad data, and it will lead your insights astray.