Blogs

Understanding the context of unstructured data

Owner, Forest Rim Technology

The computer is optimized for structured data by virtue of the database management systems (DBMSs) that operate within it. DBMSs work well managing highly repetitive data such as airline reservations, banking transactions, retail sales and so on. The same type of data is captured and stored in a redundant fashion.

In contrast, the standard DBMS has been a real kluge when trying to store and understand textual data. For quite some time, storing textual information has been a big challenge. Textual data is just not repetitive, and text fits in a standard DBMS in much the same way that a human foot fits inside a test tube. In other words, if such an unlikely fit can be accommodated at all, it will be quite awkward.

Several approaches to this challenge

Over time there have been many efforts to force text into a standard DBMS because there are benefits in trying to do so. An obvious advantage is that by storing text in a standard DBMS, organizations can handle huge amounts of text when performing analytical processing. Without access to text inside a DBMS, organizations are stuck manually trying to read text. A human being can read only so much text. Further, the human brain has a very finite limit and can handle a limited amount of text before some information is forgotten. When a person is reading huge volumes of text, the text read goes into the brain on the right side and leaks out on the left side. In contrast, there is an almost infinite amount of data that can be stored and processed inside the computer. As a result, there are some distinct advantages to trying to stuff text inside a standard DBMS as opposed to manually processing the text.

The first attempt at storing text in DBMSs was the binary large object (BLOB). BLOBs enabled text to be stored electronically, and in that sense BLOBs worked. The problem was that even though BLOBs allowed information to be stored electronically, nobody could do anything with the information. BLOBs solved the storage problem and not much else. Time has shown that BLOBs were never a real solution.

The process of tagging text marked a second approach. Text could be read, and certain words could be tagged. Once tagged, the words could be referenced relatively quickly. Tagging was certainly a step in the right direction, but tagging was at best only a stopgap solution. Tagging was like trying to hold water in a sieve. A sieve is better at holding water than merely pouring the water down the drain, but not by much. Far too much important information is lost when employing a sieve-like tagging approach.

Natural language processing (NLP) is a third approach that attempts to derive understanding of language by understanding its rules. The NLP approach definitely represents a step up from tagging, but it is very complex because the rules of language are highly multifaceted. NLP does not account for the importance of the context of text, and in 95 percent of cases the context of text resides outside of the text itself.

Textual disambiguation is now available, and it has all the advantages of NLP. The textual disambiguation approach recognizes and accounts for the majority of context residing outside of text. Textual disambiguation uses a wide variety of techniques to derive context and to edit and prepare text for entry into a standard DBMS. It enables placing text directly inside a standard DBMS once the process of disambiguation is complete. And when that placement is possible, organizations can use the text to accomplish all sorts of analytical processing to extract meaning and insight.

Practical application of textual analysis

Where is analytical processing of text important? Frankly, textual analysis is important everywhere. It’s necessary in analyzing things such as call center conversations, customer and employee surveys, email, medical records, restaurant and hotel feedback, warranty information and the like. Analyzing textual data may even be more important than analyzing classical structured data. The advent of textual disambiguation means that a longtime tough nut is finally cracked.

NLP skilled data scientists who can transform valuable algorithmic insights from text into successful business outcomes. Maybe you’d like to be the hero in your business who makes this transformation happen? Attendees at this year’s IBM Vision 2015 conference can delve deeper into these and other big-data analytics technologies. Registration for IBM Vision 2015 is still open. If you cannot attend, be sure to register for IBM VisionGO, the interactive digital platform for the conference.