Blogs

Using the right technology for classification and text analytics

Post Comment
Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

The vast majority of new data generated are in an unstructured format, and many of them are in text format. How can we take advantage of these data to generate actionable information? The answer is twofold but intertwined: by using document classification and text analytics.

Indicating meaning through part of speech

What ties these two seemingly separate activities together? The text analytics community provides an answer: the corpus. A corpus is a collection of text covering a specific subject or author. For example, some well-known corpora include the Brown corpus and the COCA corpus. Although a corpus covers a specific language, even a single language can have various corpora: American English, British English and so on.

A corpus is also divided into categories. The Brown corpus, for example, includes 15 categories, including three separate categories for press and five categories for fiction. Annotations to the documents in a corpus identify the types of words used by what is called their “part of speech.”

A word’s part of speech identifies the word as a noun, an adjective, a verb, an adverb or the like. Describing a word by its part of speech helps indicate that word’s intended meaning. For example, consider the following two sentences:

The problem was solved within a minute of the call.
The problem was solved by a few minute environmental adjustments.

The meaning of “minute”—indeed, even its pronunciation—differs in each sentence; it is each word’s part of speech makes the meaning clear. Indeed, even the meanings of words that sound the same can differ by context. Consider, for example, the difference between finding a bug in a program and finding a bug in a salad.

http://www.ibmbigdatahub.com/sites/default/files/textanalytics_embed.jpgClassifying documents using a corpus

Vocabularies differ among categories, opening the door to classification of documents through machine learning, using text analytics to take categories into consideration during processing. In brief, this means training an algorithm to classify documents according to the frequency with which they use certain words. Starting with a corpus of already classified documents, we can create a model and apply it to new documents with a high level of confidence that the model will classify them appropriately.

For example, Spark and SparkML can be used to develop a model, based on an appropriate corpus, that can then be applied in real time to incoming documents using IBM InfoSphere Streams. After being classified, the documents can be routed appropriately for text analytics and required action. Even the decision whether to store a document can depend on the result of the text analytics done after the classification.

Classification can be accomplished by dividing a text into words and counting their frequency. Because English has a great many words, each word can be converted into a hash value based on a specific number of “buckets,” producing a representation of a document’s characteristics. Better still, common words, such as “the,” “of” and “a”—also called “stop words”—can be removed from the document, as can infrequently used words.

Extracting information through text analytics

After classifying a text, we can use increasingly specific text analytics to extract proper information. Text analytics can be as simple as extracting keywords but ideally extends beyond that to reliance on context for higher precision or recall. Text analytics can be performed using a procedural API, but a declarative language can ease the writing of the extraction code.

IBM InfoSphere Streams—among other IBM products—uses a text analytics language called AQL, whose close relation to SQL can aid its adoption. AQL can consider parts of speech and can match words by their root, avoiding false distinctions between singular and plural forms or among verb tenses.

AQL even includes constructs for relative positions between words, allowing rules to be defined that require the presence of a particular word within, for example, two to five words of another specified word. Indeed, many of AQL’s features are specifically designed to aid the extraction of information from unstructured text.

The conclusion reached by text analytics can decide the value of a document, as well as whether—and where—to store that document. Text analytics can also prompt additional actions, including by raising alerts, and can do so dynamically as documents are received.

This discussion has described use of Spark to create a classification model, then use of InfoSphere Streams for continuous analysis, applying the model to classify incoming documents and identify the text analytics desired to further process the documents. Indeed, using the right technology for a task is a must—a quick and efficient solution can create business advantage. To learn more, explore the power of text analytics, streaming analytics and Spark.