Transform text to its native format to help simplify its analysis for making decisions
As the world continues to embrace big data and discover its real value, attention turns to textual data. Organizations can find value in textual data such as emails, call center exchanges, corporate contracts, warranty claims, insurance claims, and so on. Indeed, many online resources estimate that a highly substantial portion of the data in organizations—some place it at approximately 80 percent1—is unstructured data, and a significant portion of that unstructured data is textual data.2
The problem is that using textual data for making decisions is not easy. Textual data does not fit easily into a conventional database management system (DBMS), and traditional business intelligence (BI) software requires that data be in the form of a standard DBMS. So the textual data that is found in large supply in a big data repository is not easy to analyze, which can conflict with achieving outcomes for growing the business.
This situation is where the technology known as textual disambiguation comes into play. Textual disambiguation enables textual data to be read in its native narrative format, and then the text is transformed into a standard database format in which it can be analyzed by standard BI tools. In essence, the text is normalized, although the original intention of normalization was never designed for text—with apologies to Ted Codd.3
The process of textual disambiguation is quite complex, which should be no surprise given that language is inherently intricate. Textual disambiguation has many facets, but perhaps the most interesting one is that of establishment of the context of the text itself. Indeed, context-free text can be considered dangerous. For text to be used for making decisions, context must be established. The problem is that context is a very multifaceted, almost amorphous aspect of text.
Depending on the data, there are many different ways to establish the context of text. Trying to pin down how text is to have its context established is similar to playing the carnival game of smashing the gopher’s head that pushes up out of a hole. You smash one gopher, and another gopher appears immediately in another hole. You just never seem to get all those different gophers.
Perhaps the most common way to establish context of text is to use external taxonomies and ontologies. In this method, external taxonomies and ontologies are applied to raw text, and many instances of context can be established. From a mechanical standpoint, applying an external taxonomy and/or ontology to the raw text found in a document is a fairly simple task to complete.
Having multiple taxonomies and/or ontologies to apply is normal. Suppose the document is from General Motors. The external taxonomies that are applied might include car parts, car models, bills of material, and accounting taxonomies. The key challenge in applying taxonomies and ontologies to raw text is the establishment of the external taxonomy in the first place. There are, in fact, commercial suppliers of external taxonomies.
Another technique for applying taxonomies to raw text is through proximity analysis. Often, text strings take on different meanings when found in proximity to each other. For example, the words in “Dallas Cowboys” probably bring to mind a professional football team. But if “Dallas” is in one paragraph and “cowboys” appears three pages later, most likely something other than a football team comes to mind. Proximity analysis is another useful form of contextualization. There are many different forms of contextualization. The examples discussed here are merely the tip of the iceberg.
Another classical function of textual disambiguation is that of standardization. Some data needs to be standardized before it can be placed into an analytical database. An example of data needing standardization is date data. Suppose that on three documents, one has the value June 12, 2014, another has July 20th of 1945, and the other document has the value 2014/03/17. When a human reads these values, there is no problem understanding that they refer to dates. But to computers these values are just another form of text.
For this data to serve usefully in an analytical database, these values must be standardized.
And to standardize dates, computers must use textual disambiguation to carry out the following tasks:
- Recognize the dates
- Recognize which dates are being referred to
- Convert the dates to a common value
Only after textual disambiguation has performed these activities can an analytical program successfully use the data.
Another function of textual disambiguation is simple editing of data. Acronyms are common in text, but acronyms can be confusing. To properly create the analytical database, acronyms must be properly transformed into their proper meaning.
And the list of functionality for textual disambiguation goes on. Because language is so multifaceted, a very complex, very elaborate set of logic needs to take language and convert it into a normalized form of data. Not only does textual disambiguation have to assert very different and diverse functionality, the order in which those functions are applied to raw text is important as well.
In addition, textual disambiguation produces many forms of output. Accordingly, different normalized forms of data are introduced into different tables. The tables that are produced are designed so that a standard SQL join can be used to consolidate the data into a single file.
Generating an analytical database
Textual disambiguation’s final output is an analytical database that can be analyzed and used with standard BI tools. The newly minted analytical database looks like any other database the end user has been analyzing, except that the source of information in the analytical database is text. End users doing analysis are probably not aware that the textual analytical database is anything different than what they have been using for years.
The analytical database has nothing in it that distinguishes it from any other database used for analytics. However, now organizations can easily and naturally start to include text in their decision-making processes.
Please share any thoughts or questions in the comments.
1 “Apply New Analytics Tools to Reveal New Opportunities,” IBM Smarter Planet website.
2 See also, “Textual Data – A Brief Sojourn,” by Bill Inmon, BeyeNETWORK blog, February 2012.
3 SELECT * FROM SQL History, Edgar “Ted” Codd biography, FairCom website.