Taking the next step toward text analytics
When I first started speaking English, becoming fluent within the context of my work didn’t take me too long. I could easily converse about computer problems and so on, but in social situations, English was much more difficult. I did not have the proper vocabulary within the context of social engagements. The same is true for text analytics.
Linguistic researchers build corpora to represent different domains. A corpus is a collection of analyzed text used for linguistic research. Information such as word frequency and word usage help in building language understanding. A corpus can be divided into categories. For example, the Corpus of Contemporary American English is divided into categories such as academic, fiction, magazine, newspaper and spoken. Note that the name itself indicates that it is specific to contemporary American English. The corpus may not be as useful if applied to British English.
Different corpora apply to different domains. A corpus built for medicine may have multiple categories for different specialties of medicine. But such a corpus would likely be almost useless in a computer science context. Then consider emails, instant messaging, mobile phone texting, Twitter and, believe it or not, so-called cell phone novels that began surfacing in Japan in 2003. Imagine how different a corpus would be for these channels of communication.
If we try to analyze emails or text messages, we would need to informally create a corpus by looking through a set of messages to identify the abbreviations that are used and the way the messages are written. Suppose we identify abbreviations such as IMHO, LOL, P2P, TBD, U and UR. The first task is to identify their meanings or if any have multiple meanings. For example, does P2P represent peer to peer, parent to parent or something else?
Next, we can ask ourselves, “Can we ignore the abbreviations? Are they important for our analysis, or do they interfere with our analysis?” The answer depends on the analytical context. In one extreme example, an entire message could be written in abbreviations:
OK IDK FWIW IMHO ROTFL JK ILY TTFN
In this case, the message is totally useless, if we are using a standard English approach. Does that mean we have to throw everything away and create a new language and grammar? Not necessarily. We could use a translation step instead. Annotation query language (AQL) allows for extracting tokens and, through the use of a table construct, map some words to an expression in a table definition that addresses this problem:
create table convert (short Text, long Text) as
. . .
('TTFN', 'Ta ta for now.') ;
The content of the table could also be provided at load time instead of being part of the source code. In addition, we need to decide how to map the abbreviations. Should OK stay as OK? Should TTFN be simply replaced with "goodbye"? Even if doing the conversion in AQL were possible, it seems natural instead to catch these cases as the text is ingested. The conversion is simple and straightforward and does not require advanced analytics. A simple mapping in an IBM InfoSphere Streams job can easily take care of the conversion and pass the processed text to AQL for in-depth processing.
At this point, clearly you see that text analytics is domain dependent. In many cases, it provides a small vocabulary set and recurring sentence constructs. We can take advantage of a specific domain to convert acronyms and abbreviations, if necessary, to make the text easy to analyze through standard tooling.
This approach is one step closer toward taking advantage of text analytics in our environment. Look for further discussion in upcoming posts.