Big data powers the practical know-it-all in us all

Big Data Evangelist, IBM

A living subject domain is conceptual territory that must be scouted continually. Even if you're long familiar with the domain's heartland (from study, reading, work history, the School of Hard Knocks or other firsthand experience) a subject's frontiers may have shifted while you weren't paying attention. Or the domain may be so vast, dynamic and multifaceted that no human can reasonably be expected to keep up with it all.

Sometimes your job depends on sustaining your reputation as an expert on a subject domain. At the very least, you may be the go-to person who knows where all relevant expertise on a subject resides and can retrieve all of it on demand. Your career may hinge on creating an image of subject-specific omniscience. What that means, in practice, is you have the full library of knowledge on that subject near at hand, and that the core of it is cached in your working memory.

How can you rapidly identify the entire corpus of relevant knowledge on a specific subject? I refer to this as corpus omniscience (not to be confused with cosmic omniscience, which is only possessed by deities, Sherlock Holmes and other transcendental beings). Corpus omniscience is not some blue-sky fantasy of brainiac geeks (which I admit I am); rather, it is the daily imperative of various sorts of knowledge workers that Johannes Scholtes calls out in his recent article on "Text analysis: The next step for eDiscovery, Legacy Information Clean-up and Enterprise Information Archiving."

According to Scholtes, "Text analysis is particularly interesting in areas where users must discover new information, such as, in criminal investigations, legal discovery and when performing due diligence investigations. Such investigations require 100 percent recall; for example, users cannot afford to miss any relevant information. In contrast, a user who uses a standard search engine to search the internet for background information simply requires any information as long as it is reliable. During due diligence, a lawyer certainly wants to find all possible liabilities and is not interested in finding only the obvious ones."

In his detailed discussion of text analytics, Scholtes lists several roles that depend on subject-specific corpus omniscience and that increasingly leverage big-data search tools to achieve it. For such people, it's not a matter of finding the most relevant search hits—it's more a matter of finding, sorting, organizing and analyzing all relevant hits responsive to any question they might conceivably ask in the subject domain at hand. Per Scholtes' discussion, some of the roles that need corpus omniscience include:

  • "Investigators want to comb documents to find key facts or associations (the smoking gun)"
  • "Lawyers need to find privileged or responsive documents"
  • "Patent lawyers need to search for related patents or prior art"
  • "Historians need to find and analyze precedents and peer-reviewed data."

Beyond professional pride, the key reason why any of them needs fast know-it-all powers is risk mitigation. Failure to get their hands, and heads, around literally everything of relevance means likely failure to achieve their core objectives. The investigator who fails to discover all material facts risks fingering the wrong perpetrator or no one at all. The litigator who fails to find all relevant documentation risks losing the case. The patent firm that overlooks a key piece of prior art risks seeing their client's application be rejected by the patent office. And the historian who fails to consider the entire body of research and commentary on their topic risks having their work repudiated in peer review.

You could soften the concept by introducing the notion of quasi-omniscience, based on some notion of confidence that you've e-discovered 99+ percent of the domain-relevant material. Regardless of whether you're strict or soft with this concept, multi-layered metadata is key to ensuring that a search engine can deliver on this vision.

Scholtes provides a staggering inventory of the layers of metadata, both raw and derived, that would be needed for search engine omniscience (strict or soft). Devoting his remarks mostly to corpora consisting of unstructured and semi-structured text, he includes metadata associated with file systems, documents, emails and collaboration portals. He also includes metadata associated with hash calculation, duplicate detection, language detection, concept extraction, entity extraction, fact extraction, attribute extraction, event extraction, sentiment detection and extended natural language processing. He brings machine learning models into the discussion as well, owing to their importance in detecting heretofore hidden patterns in the data that might indicate the presence of domain-relevant content. And so on.

Beyond a super-powerful search engine, such as IBM Watson, one other thing is essential for ensuring corpus omnisicience: a powerful social network in which you can find the next best expert on any topic at hand. No matter how powerful they might be, automated search engines enable a type of “simplex” knowledge transfer—in other words, delivery of intelligence from a machine or cloud back to only the person doing the querying. However, search engines rarely enable the predominant person-to-person flow of mission-critical expertise.

In order to have confidence that you have surveyed the entire relevant corpus, it's quite important to have a network of domain experts at your disposal. Quite often, the most important insights are those that issue from other people’s heads, not from any specific search engine or big data store. Many real-world intelligence flows are full-duplex, many-to-many and person-to-person in orientation. This fundamental truth continues to drive the spread of social-centric knowledge-sharing architectures in business intelligence and advanced analytics solutions.

Call it social search. It thrives on all the metadata, tags, recommendations and descriptors that help people to size up each other's relevance to the corpus.