Big data and the power of positive curation

Big Data Evangelist, IBM

JamesKobielus_blog.jpgCuration is the million-dollar word in today’s big-data-drenched online culture.

If you’re like me, you’re both skeptical of and tantalized by new words and phrases that come into popular usage, or by existing verbiage that seems to have acquired new meanings virtually overnight. You’re skeptical because you sense that these terms may be superfluous. They may be treading on the semantic territory associated with long-established terms that everybody understands. That’s certainly the case with “curation” in a data context. It feels suspiciously close to “stewardship” in its focus on human assessment of the value of information. One senses that people rarely stop to parse fine definitional distinctions, if any, between curation and this older term.

But you may also be attracted to this new term, sensing that it serves a purpose that “stewardship” has never addressed adequately. For me, the best way to disambiguate near-synonyms is to match up what they connote on various levels. In the case of “curation” vs. “stewardship,” we should compare how they address the critical function of helping users take confidence in information’s actionability based on various criteria of its quality.

Where stewardship is concerned, the primary quality criterion is data’s trustworthiness. This is manifest in the larger notion of a “single version of truth” in various application domains. This, in turn, refers to the need for a repository where officially sanctioned systems of record are consolidated after they’ve undergone a process of profiling, matching, merging, correction and enhancement. The repository is often called a “data warehouse.” The data in question is usually from structured sources only, or, if from multistructured sources as well, is usually transformed to a structured, relational format before being loaded into the data warehouse. Stewardship is the process of overseeing and enforcing this end-to-end process in accordance with policies defined by data’s owners. The stewardship function is often primarily an administrative workflow that may be automated to a great degree, with human judgment’s role primarily in adjudicating the application of policies to new, exceptional or anomalous data records.

By contrast, curation addresses the data quality criterion of relevance. Consequently, curators might be regarded as being responsible for a “single version of what’s worthy of your consideration.” Curated information may be persisted in any downstream, shared data platform in which users expect to find all (or perhaps a convenient subset) of what’s relevant to a particular application domain or usage scenario. The repositories in which curated information is kept are not usually official systems of record, but, rather, systems of insight, in the broader sense of the latter.

With systems of insight, the data that drives insights may incorporate sources with varying degrees of trustworthiness and relevance. This means that on some data types and sources, there may be no firm assurance that all or any of the curated information is up-to-date, consistent or correct. This governance-light curated information is often from multistructured sources and is often persisted in its original formats as it’s persisted downstream. Due to the fact that much of the curated data is from unstructured sources, it may often reside and be consumed in big data analytics platforms.

Fundamentally, curation refers to participation in this process as a subject matter expert who discovers, reviews, refines, analyzes, organizes, selects, tags, contextualizes and recommends relevant information. Curators generally aren’t typically the data’s owners, and are often distinct from the stewards who administer the trustworthiness policies and procedures regarding official system-of-record data. Also, curators tend to exercise human judgment to a high degree and share out their content recommendations in collaborative and social contexts, albeit with various degree of automated guidance. This is in contrast to the structured administrative workflows associated with data stewardship.

All of this discussion should help you understand how IBM is defining “curation” as implemented in the new offering Watson Curator, recently announced at IBM Insight 2014. Watson Curator is “a software-as-a-service offering that increases confidence in the delivery of quality content collections and governance for IBM Watson Solutions. For example, an individual insurance risk analyst can quickly review and add context to documents so that many underwriters across the enterprise can get higher quality responses on risk assessments from Watson Engagement Advisor. IBM Watson Curator actively guides subject matter experts (in this case the risk expert) through the entire curation process in order to minimize the time and effort required. This capability improves the relevance and quality of the information used for analytics.”

Clearly, the curator in this hypothetical example (an insurance risk expert) is not usually the same individual who handles stewardship over policyholder data. Nor should they be. Instead, the curator is someone who is closer operationally, and in professional background, to the business outcomes that can be realized more effectively through expert-contextualized policyholder data. The curator relies on the steward to ensure that the data to be contextualized is current, correct and consistent. And they both rely on data archivists to make sure that all information that’s been officially curated and cleansed is retained for as long as necessary in accessible, queryable databases.

Curation, like stewardship and archiving, needs to be instituted as a collaborative function in most data analytic environments. You can’t curate data effectively if the data is messy beyond all belief. And you can’t clean it up or assess its full relevance of the data simply isn’t there anymore.