Want to Glean Sentiment from Structured Social Media Data?

Apply text analytics to streaming, real-time data and data at rest to derive valuable insight

Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

In this burgeoning era of big data, a substantial majority of all the data is unstructured. Much of this unstructured data is textual, such as the data in reports, articles, emails, tweets, and even conversations or support calls recorded in textual transcripts. Because some of this information is perishable, the capability to process it quickly—in many cases, in real time or near-real time—is becoming quite important to enterprises. This processing requires text analytics capabilities.

What does the capability to perform text analytics mean? One simple example is processing a social media feed, such as Twitter, to extract any tweet that mentions a specific element of data such as a company name or a product. This simple approach to matching keywords can provide a quick glimpse into the presence of a product name in the public’s mind. Further, using Twitter-based metadata, for example, creates the possibility to divide this public perception into regions.

By matching this information with the release of a marketing campaign, for example, an organization can gain insight into the campaign’s estimated effectiveness. This simple, keyword-matching approach combines unstructured data—a tweet—with structured data—the Twitter feed’s metadata—to generate structured data that can be analyzed for deriving insights an organization can use to grow a business or enhance business strategy.

Analyzing streaming real-time data or data at rest

In this tweet keyword-matching example, the capability to process tweets in real time makes perfect sense because the example offers a typical use case for data exhaust processing. Data exhaust is related data generated by online activities such as clicking links, selecting options, and so on. Out of the millions of tweets being generated, only a small fraction fits the requirements necessary for analysis. Therefore, eliminating a large percentage of tweets as they are received can greatly reduce processing and storage requirements further down the pipe.

The keyword-matching example presented here also demonstrates the need for a hybrid approach: analyzing both streaming, real-time data and stored data at rest for further historical analysis. Analysis of streaming, real-time data can go further and provide quick feedback for actions to be taken. Additional analysis, correlation, and aggregation can be applied to stored data at rest. For example, an organization can find out how many people talked about its products, evaluate its reach based on the number of followers and those who retweet, and other insights.

Performing a complex extraction of modifiers

The widely cited quotation, “I don’t care what the newspapers say about me as long as they spell my name right,” may be a true sentiment for some celebrities or public figures. After all, there is no such thing as bad publicity when it comes to reinforcing name recognition. But in the tweet keyword-matching example, knowing whether people feel positive or negative toward a product is probably a good idea for many enterprises. Again, part of the analysis can be done in real time to provide instant feedback for enhanced brand management, but a more in-depth analysis can be made on historical data for time periods that could be as short or as long as required.

And the analysis can be taken one step further by creating two separate lists: one for positive adjectives and one for negative adjectives. These lists enable adding a sentiment indicator that can be, for example, positive, negative, or neutral. Lists of adjectives can easily be found through a web-based search and serve as a starting point.

Searching for a match from a list of adjectives or other words may sound trivial, but the task can require quite a bit of work when starting from scratch. Having the proper tool is vital to be able to quickly put a solution in place. For example, the Annotation Query Language (AQL) available in IBM® InfoSphere® Streams streaming analytics and IBM InfoSphere BigInsights® analytics software provides constructs for dictionaries. Matching keywords with such a construct can be performed with the following code:

create dictionary PositiveAdjs
from file 'dicts/PositiveAdj.dict'
with language as 'en';

create view positive as
extract dictionary 'PositiveAdjs'
with flags 'IgnoreCase'
on R.text as match
from Document R ;

The create dictionary statement identifies a list of words that exists in a file. The create view statement does a case-insensitive matching to quickly arrive at a solution. This construct demonstrates that a minor amount of work generates changing, adding, or removing adjectives. As a result, an enterprise can be highly agile by responding quickly to changing needs.

Moving toward advanced text analytics

While the keyword-matching example demonstrated here offers a compelling look at how combining structured and unstructured data from a social media data source can be used to derive insight, it barely scratches the surface of the text analytics topic. Nevertheless, enterprises can apply simple text analysis to generate structured data from which valuable insight can be derived to enhance business strategies. And a balance of real-time data and at-rest data analysis is an important component when implementing text analytics in many enterprises. Other characteristics to explore for text analytics include understanding the domain-specific nature of text analytics, proximity matching of keywords, word lemming and stemming, parts of speech, and precision of results. Look for upcoming articles that take a deep dive into these aspects of text analytics.

Please share any thoughts or questions in the comments.