Blogs

Extracting Information By the Numbers

Apply text analytics that looks for word proximity to achieve extraction of specific information

Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

Analyzing text doesn’t mean looking only for keywords. Numbers can also be the information analysts need to glean from data. These numbers can be integers, floating-point numbers, currency, and many other numerical units.

The recent IBM Data magazine article, “Want to Glean Sentiment from Structured Social Media Data?”* discussed extracting meaning from text by looking for specific words. This simple approach enables extracting valuable information from structured social media data. Here, some additional effort can be applied to go a few steps further and embellish the derived textual data with numerical information.

Gleaning specific information

A good text analytics tool needs to define what a number is, often by using what is called a regular expression. Using the Annotation Query Language (AQL), analysts can define a number that represents a dollar amount as follows:

create view Number as
extract regex
/\d+(\.\d+)?/
on D.text as match
from Document D;

This bit of code basically looks for one or more digits, followed optionally by a period and at least one digit. If necessary, the definition can be made even more restrictive by requiring two digits after the decimal point. The Number definition serves as a new component in the arsenal for building increasingly complex definitions—a building block.

A currency symbol next to a number defines it as currency, which leads to another important concept in regard to text analytics—using building blocks to achieve the end goal of identifying a currency amount. Another definition, or building block, is created to identify a currency symbol. Then, a currency amount is defined as being both definitions that are found next to each other. An additional building block is likely required, depending on the kind of information to be extracted. Consider the following sentence: “Reported revenue is $22.4 billion.”

In this case, using an additional building block adds the billion qualification to the amount, which could also be “thousand,” “million,” “trillion,” and so on.

Using multiple building blocks together in this manner provides enhanced value because their proximity indicates something different—such as a dollar amount. However, analysts may not necessarily want to extract the result, $22.4 billion, as a unit. To insert the result in a database, the string requires parsing again to convert it to the appropriate format because a relational database likely expects a numeric value for the revenue column. Analysts can save a step in the conversion by identifying the entire result and return it as separate parts.

Getting back to the example, what does this dollar amount represent? Obviously, it is a measure of revenue. An analyst can again take advantage of the building blocks’ proximity and implement a specific match for “revenue of.” However, this approach may be shortsighted. What if other source documents contain sentences such as the following:

“Reported revenue increased to $22.4 billion.”
“Reported revenue improvement is now at $22.4 billion.”

The important word here is revenue. There may be a few words between it and the amount the analyst is looking for. Also, the words between the word “revenue” and the amount do not matter. A good text analytics tool should be able to accommodate the capability to complete this matching, which can be done in AQL using the following expression:

create view revenue as
extract pattern <RI.match> <Token> {0, 4} as match
from revenueIndicator RI, Amount A
consolidate on match using 'ContainedWithin';

The key to this expression is the use of <token> {0, 4}. It indicates there can be zero to four words between the two building blocks. This capability allows for building general extractions that work on a set of documents instead of being specific to a single document.

Building efficiency into text analytics

These techniques make a text analytics approach highly efficient. Using building blocks and general proximity definitions to match the information to look for helps increase the effectiveness of extractors. There is a lot more to know about text analytics. Analysts need to be aware of other challenges that exist before embarking on projects that employ text analytics approaches, and goals must be clearly defined. Look for upcoming articles that discuss these challenges and goals.

Please share any thoughts or questions in the comments.

*Want to Glean Sentiment from Structured Social Media Data?” by Jacques Roy, IBM Data magazine, February 2015.