Why is analyzing text so hard?
I selected text analytics for my presentation at Strata this year because of the confusion I hear from customers who are trying to understand the myriad of overlapping technologies in this space. Once one gets past the simple case of identifying specific words within the text, one needs lexical and linguistic skills to gain deeper insights. The usual response leans towards “Why is analyzing text so hard!?”
Here’s the challenge: How can we take documents, email or, say, social media, and show what they represent? For example, how can one readily pull out of the IBM annual report the asset values changing over time, or what topics are covered and how they are related?
Text analysis is a multi-step process
From a technical perspective, the end goal is to transform text into rows and columns which enable relating this data to other sources. The process generally requires two distinct steps, each of which may be iterative.
The first step when analyzing text is to recognize what’s useful and what’s not. The typical tasks are:
- Detect terms based on data types, specific words and linguistic context
- Classify and organize terms, often leveraging dictionaries or ontologies
- Describe these terms, ranging from simple rating systems to complex statistical methods
These tasks are typically referred to as information extraction. Technologies such as natural language processing (NLP), named entity recognition and sentiment analysis are used to extract the quantitative and qualitative elements from a text source.
For example, consider this sentence: “Software revenue of $25,932 million increased 1.9 percent as reported and 3 percent adjusted for currency in 2013 compared to 2012.”
When analyzed, specific words are identified such as “million” (for the units of the number preceding it), contextual clues such as “increased” are used to determine if the value “1.9 percent” is negative or positive and a rating of “Modest” is assigned to Growth Rating as the value is between one percent and three percent.
The fields or columns extracted from this one sentence might look like this:
If you’re thinking “Hey, that information extraction stuff was really useful! Isn’t that text analytics?” then you’re not alone.
Information extraction is applied to the text contained in a single field, file or document. When the analysis requires scoring or ranking, establishing trends or other processing across multiple sources, the outputs from the information extraction step are used as inputs for further processing.
Statistical or predictive models, search engines and entity analytics are a few examples of the technologies leveraging information extraction outputs in order to infer patterns and relationships. Applications include detecting identify theft, correlating symptoms to patient readmission or assessing customer satisfaction during a call to customer support.
Slang, sarcasm and culture
Language adapts over time, between geographies and with popular culture. Here are two common challenges in this category:
- Same word, different meanings: My teens might describe a good concert as “sick,” whereas I tend to use “sick” in a more conventional manner.
- Same meaning, different words: Australians, Britons and Americans use the terms “daks,” “trousers” and “pants” respectively to refer to the same piece of clothing. A note for my fellow North Americans: Do not use “pants” in the U.K. You will be treated to some sniggering if you do.
Analyzing text in one language is hard enough; analyzing text in many languages multiplies the challenges. For example, the name “John” in English is “Jean” in French and “Juan” in Spanish. A person may be known by all three names, depending where he happens to be and who he is with.
Want to learn more?
The upcoming Strata + Hadoop World conference in New York City on October 15 through 17 is a great place to learn more about how to address the challenges of analyzing text. I hope to see you at my own session Extending "Variety" of Data to "Variety" of Users on Friday, October 17 at 11:50 a.m. ET.
Read more about IBM client successes with text analytics:
- Hamilton County Department of Education uses text analytics to translate the subtleties embedded in the art of teaching teachers into measurable behavioral data that it can tie to concrete academic results. Advanced statistical models are uncovering which teaching approaches—in areas such as classroom management and planning—correlate most closely with improvements in student performance.
- Security First Insurance experiences a massive amount of communication after a natural disaster as customers use company Facebook pages, Twitter accounts, email and call centers in hopes of reaching an employee. Text analytics sifts through the communication to detect significant property damage and stress as input to determining how best to address the customer’s needs.