Enhancing Survey Data with Related Social Media

Mine valuable missing data from social media channels to augment survey data

In recent years, using Twitter to perform reputation analysis of an organization’s brand and products has become commonplace in many industries. This method utilizes elements such as Twitter hash tags and manually provided key words and phrases to extract mentions of brands or products that can inform decisions within organizations. However, still largely unexplored is a Twitter analysis that offers a more sophisticated analytical approach than an entity reputation analysis.

Those who are familiar with Twitter know that understanding the idea of what someone is tweeting is often much more complex than something that can be matched to a simple list of key phrases. Cognitive computing that can understand sentence structure, embedded phrase relations, synonymy, and sarcasm is essential to uncovering this meaning automatically from large quantities of tweets. Twitter has the potential to be used for automatic and ad hoc surveys by organizations to extract public opinion or demographic-based opinion that is of internal strategic interest.

This basic idea is far from being new. Today, there are examples everywhere of organizations using social media—particularly Twitter—to engage with and live-poll their customers. Look no further than every major news station, or even SportsNation on ESPN—an entire show dedicated to this topic. However, casual surveys of the public are perhaps more intriguing to many organizations than the idea of engaging with their customer base for every market research data point. Achieving surveys of this type can be tricky because tracking general ideas and topics cannot be accomplished with a simple list of terms in the way that mentions of products, people, organizations, and locations can be tracked.


Turning to open data

To verify the technology necessary to complete this task on publically available data, the IBM Big Data Stampede team turned to the New York City open data (NYC OpenData) collection.1 A large data set available in this repository is nonemergency municipal services data reported directly to the New York City 311 (NYC 311) agency.2 The team had noticed that NYC 311 had actually taken steps to engage with social media users to augment their reporting service with, for example, their @nyc311 Twitter handle. As such, the team thought that the idea of augmenting NYC 311 data with complaints found on Twitter could serve as a representative automatic survey technology proof of concept.

Over the course of two weeks, two million tweets from New York City residents were collected and ingested into an IBM® InfoSphere® BigInsights™ platform implementation for analysis. For simplicity, the study focused solely on heating complaints—the most prevalent category in the NYC 311 agency’s data over the two-week period.3


Mining tweets

The IBM team’s first instinct was to analyze these collected tweets in the typical way many organizations start. The team searched the ingested Twitter data to find Twitter handle mentions and hash tags related to NYC 311 and heating services. Surprisingly, and quite discouragingly, at first nothing was found. Absolutely nothing had been reported to @nyc311, there were no #nyc311 or similar hash tags, and there was even no hash tag that was remotely related to heating services.

On some level, the team had little trouble justifying to itself why no tweets had been found. People simply don’t tweet in meaningful numbers about every topic. After all, wouldn’t it be out of place? People likely don’t have the same desire to stay engaged with NYC 311 services the way they do with their favorite television shows or brands.

As a result, the IBM team was ready to move beyond thinking about NYC 311 and Twitter in New York City. It could see people were tweeting in significant numbers about a lot of topics relevant to the city that were worth exploring.4 There was a wide assortment of discussions about topics ranging from the schools, to the parks, to the mayor, to concerts. But then the classic entrepreneur’s argument just kept creeping into the back of the collective mind of the IBM team. Sometimes there is only a handful of NYC 311 heating complaints in a day, and if just one in a million tweets is related to a heating complaint, Twitter can add some value here.


Digging deeper

The team determined it simply couldn’t rely on the usual suspects such as hash tags and twitter handles to find heating complaints in tweets. Therefore, the team developed a machine-learning solution that would allow it to search through and really understand the tweets the team had collected. This system was put together simply by leveraging powerful tools provided in InfoSphere BigInsights and open source tools from a massive community of developers and researchers interested in analyzing social media and text.5

The team then made a second attempt to search the collection of tweets for heating complaints, this time with the aid of two very powerful tools. A conceptual fuzzy search application allowed the team to filter tweets to reveal only those that were most relevant to the idea of a heating complaint. Moreover, a Twitter natural language processing (NLP) annotation pipeline application allowed for further examination of the grammar, word relationships, and sentiment in the now-filtered tweets.

Twitter is actually very difficult to analyze using machine learning. The brevity of tweets, the use of slang, the introduction of emoticons, odd punctuation and capitalization, misspellings, and even frequent sarcasm, among other things, result in some errors, even when utilizing state-of-the-art analysis techniques.

The key, very often, is not relying on a single point of failure. The hope here is that this error can be mitigated by mixing nonoverlapping sources of error as input in a final classifier application. The team trained a few different classifiers using the aforementioned features in addition to words and phrases on a training corpus compiled from a large historical list of heating complaints embedded in a sea of unrelated tweets.

The team ultimately utilized the Stanford classifier, a Java implementation of a maximum entropy classifier—known to pick up information only when it has a reason to as opposed to fitting a formula.6 This classifier can be highly effective for this use case, and it was deployed on the set of two million tweets. As a result, the IBM team successfully found heating complaints in the set of tweets after all. In fact, combining machine learning and big data did indeed fulfill the prophecy; the team found the figurative needle in the haystack.


Expanding survey approaches

From a technology perspective, this work serves as a proof of concept for applying automatic surveying techniques to social media. Heating complaints were found completely automatically by an intelligent program sifting through a sea of data. For some days, the team actually found more than half as many novel complaints on Twitter as were reported the same day to the NYC 311 agency.

The word novel makes an important distinction because the team found that half of all heating complaints in tweets did have a corresponding complaint reported to NYC 311. This case demonstrates how using Twitter in this way can add even more value for an organization. By merging data across domains, InfoSphere BigInsights can be used to enhance the profile of a complaint with all kinds of information about the person who reported it.

So, what kind of conclusions can the team draw upon based on this work? Is direct surveying a tool of the past? Should organizations get rid of their survey infrastructure and completely replace it with automatic social media analysis?

Without giving into a “big data is everything for everybody” premise, the team acknowledges that the reality is most questions are not possible to survey from social media. People generally just don’t talk about these topics on Twitter in a meaningful way. However, the IBM team at least showcased the concept of extracting implicit, very rare information from Twitter, and the end users made no attempt at all to participate in this survey.


Exploring social media possibilities

While organizations will not necessarily find the answers to all their questions in social media today, there are many unexplored questions that can be answered. But there is one catch: powerful software and domain expertise are required to make it happen. Organizations interested in getting involved with big data or social media that just need some guidance to get started are not alone. Integrated offerings such as IBM Big Data Stampede™ expertise7 and IBM Watson Foundations provide organizations an accelerated path toward a big data initiative such as the one discussed here.

Please share any thoughts or questions in the comments.

1 NYC OpenData, official lists of NYC data sets, City of New York website.
2 NYC 311 agency official website.
3Turning Up the Heat on Big Data,” by Soruav Mazumder, IBM Data magazine, June 2014.
4Keeping the Trains Running On Time,” by Sourav Mazumder, Matthew Riemer, and Boris Vishnevsky, IBM Data magazine, May 2014.
5 For a technical overview of the system, see “Applying Machine Learning to Text Analytics,” by Matthew Riemer, IBM Data magazine, June 2014.
6 The Stanford Natural Language Processing Group, Stanford Classifier information website.
7 For more information on guidance to get started with a machine-learning project, visit the IBM Big Data Stampede site.


[followbutton username='mattriemer' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']