What is text analytics?

Making the complex simple

Managing Director of Intelligent Business Strategies Limited, Intelligent Business Strategies Limited

Text analysis is about deriving high-quality structured data from unstructured text. Another name for text analytics is text mining. A good reason for using text analytics might be to extract additional data about customers from unstructured data sources to enrich customer master data, to produce new customer insight or to determine sentiment about products and services. Several text analytics use cases exist: 

  • Case management—for example, insurance claims assessment, healthcare patient records and crime-related interviews and reports
  • Competitor analysis
  • Fault management and field-service optimization
  • Legal ediscovery in litigation cases
  • Media coverage analysis
  • Pharmaceutical drug trial improvement
  • Sentiment analytics
  • Voice of the customer

Entity extraction,the parsing and extracting of entities from raw text, is a key part of text analytics. Several examples of entity extraction exist:

  • Company names
  • Dates and times
  • Domain-specific names such as names of diseases in pharmaceutical data
  • Monetary amounts
  • People’s names and social network handles
  • Phrases, negative or positive
  • Product names 

In many cases, entity extraction can be turned into automated entity recognition, in which text is parsed and well-understood entities are automatically selected from the text by the software. This requirement is common. In the previous list of use cases, recognizing dates, times, monetary amounts and so on may be something that text analytics software can do out of the box, without you having to help it figure out what these extractions are. The benefit of this approach should be obvious. The time necessary to do entity extraction is drastically reduced, it can be done at scale across large volumes of text and it produces structured data that can be merged with enterprise data for further analysis.

Proving its mettle

Widely used data sources for text analytics include social networks—Facebook, LinkedIn and Twitter—along with internal email, inbound customer email, news articles, online discussion forums and customer relationship management (CRM) customer service notes. Other widely used sources include review websites such as TripAdvisor, documents such as PDF files and online forms such as applications containing text or forms containing structured data stored as text. 

Text analytics has been around for many years, and with as much as 80 percent of data in enterprises now in unstructured form, it is proving its worth very quickly. A well-understood process for text analytics includes the following steps:

  1. Extracting raw text
  2. Tokenizing the text—that is, breaking it down into words and phrases
  3. Detecting term boundaries
  4. Detecting sentence boundaries
  5. Tagging parts of speech—words such as nouns and verbs
  6. Tagging named entities so that they are identified—for example, a person, a company, a place, a gene, a disease, a product and so on
  7. Parsing—for example, extracting facts and entities from the tagged text
  8. Extracting knowledge to understand concepts such as a personal injury within an accident claim 

Text analytics can also be combined with other advanced analytics. For example, it can be combined with other data and with machine learning to predict the sentiment trend—such as negative sentiment predicted to increase or people who are likely to tweet negative sentiment based on their profile. Text analytics can also be combined with graph analysis, whereby people, places, activities and things are extracted from text using entity extraction and fed into a graph and graph analysis to discover completely new relationships you weren’t previously aware of.

Probably one of the most widely used forms of text analytics is sentiment analysis. Inbound customer email and social network data such as tweets may help determine positive, negative or neutral sentiment about your products and brands. Sentiment analysis is the process of determining a sentiment score from text. Companies want to determine these scores to be able to respond quickly to negative sentiment to minimize its impact and to make sure customer satisfaction and loyalty are constantly being improved. They also want to protect brands. Product managers want to use sentiment analysis to understand any problems with newly released products and services, so they can fix them quickly. In addition, finance departments want to take sentiment into account when doing financial planning—something that may be surprising to many.

Negotiating hurdles

All kinds of challenges are evident when using text analysis to analyze sentiment. For example, people tweet in multiple languages. Data quality can be poor. Many people also use emoticons, and different generations seem to speak their own language, even if it is all in one actual language of the world. Slang can also be an obvious challenge. And ambiguities can also exist—for example, a young person may tweet, “This new phone is sick!”

The good news is that text analytics is now mature enough to handle many of these characteristics. Ultimately, sentiment analysis is about opinions that may be transient and therefore have short-lived impact. Of course, an opinion consists of a number of elements including the name of a product or brand, perhaps a part of a product, the sentiment itself and even intent or desire—as in, “I really want to get that new Star Wars DVD.” Other elements include the opinion holder and when the opinion—the time of the tweet—was expressed.

New visualizations exist for text analytics to easily understand sentiment in a huge amount of text. Examples include word clouds that enable you to see how frequently words appear in a given corpus of text. Also, word trees and phrase nets offer other examples. The former allow you to choose a word or phrase and show you all the different contexts in which it appears; the latter show the relationships between different words used in a text.

Text analysis can be a computationally intensive process involving complex character-level operations such as pattern matching. Therefore, for large volumes scalability matters and is a major reason why people use platforms such as Apache Hadoop to do batch text analysis. However, you can also do the analysis in real time using streaming text analytics to react to sentiment or even to detect fraud as people are filling out online application forms, for example.

Capitalizing on many flavors

Today, you can find text analytics available on the cloud, on-premises in stand-alone text analytics offerings and at scale on Hadoop. IBM offers text analytics as part of the IBM BigInsights Data Science module. This solution includes web-based tooling to extract information that generates a language called Annotation Query Language (AQL) to do the analysis. IBM also includes application templates in BigInsights to get you up and running quickly with sentiment analysis.

Text analysis can also be done using search, which allows for indexing raw text and launching exploratory queries on that data to find content of interest. Connecting self-service business intelligence (BI) tools to search engines enables using search-based text analytics from within BI tools and to visualize the results in dashboards. Text analytics represents an exciting area that can generate significant business value.