Applying Machine Learning to Text Analytics

See how InfoSphere BigInsights was used to derive meaning from social media to augment NYC 311 data

As the value of big data becomes increasingly apparent to organizations, many are expected to engage in projects to create custom applications with machine-learning capabilities. The business value of these applications arises both from discovery of insights and the deployment phases of application logic development. As such, a big data platform that provides a large amount of flexibility in these phases is important so that organizations can customize their capabilities and comprehensive integration support to ease the process of gaining favorable outcomes from these applications.

The IBM® InfoSphere® BigInsights™ platform provides simple integration with powerful open source projects and an easy-to-use framework for rapid development and deployment of advanced machine-learning models. Some powerful integration capabilities in InfoSphere BigInsights are highlighted here for helping jumpstart leading-edge text analysis. The platform offers a wide variety of tools to enable organizations to analyze text. In particular, the Social Data Accelerator (SDA), Machine Data Accelerator (MDA), Annotation Query Language (AQL), BigSheets, Lucene, Jaql, and Big R tools provide a broad range of advanced text analytics capabilities.

However, there are still many scenarios in which organizations can find increased value by extending the capabilities of InfoSphere BigInsights to create customized machine-learning applications. For example, organizations can augment their text analytics capabilities for fine-grained analysis of tweet semantics, which in turn can substantially increase the accuracy of the organization’s machine-learning models that analyze tweets.

These models can be built for various applications that analyze Twitter feeds, such as the work by the IBM Big Data Stampede1 team to augment NYC 311 data with tweets complaining about residential building heating systems.2 And this scenario can serve as a general framework for augmenting InfoSphere BigInsights analytics capabilities through a simple integration process.

Relevancy analysis for Twitter

Often, the first step in analyzing social media is determining which posts are relevant to a topic. This process can be simplified using InfoSphere BigInsights integration with Lucene—a powerful open source search tool that conveniently integrates with InfoSphere BigInsights through Jaql. The integration of Jaql and Lucene masks a lot of the complexity of using Java code to get Lucene going. Additionally, Jaql handles the entire MapReduce aspect of the code internally and does not require developers to think through these considerations, making it easy to set up a custom search application in a big data environment.

The Jaql interface is designed to be easily extended with search engine capabilities, and Jaql can be extended with Java User-Defined Functions (UDFs). In the tweet analysis example, Lucene fuzzy search—to account for misspellings—was extended with WordNet—an open source synonym engine—and word stemming for the compression of words to root meanings. It was also extended with language normalization from the TwitIE pipeline within the General Architecture for Text Engineering (GATE) open source project, which translates Twitter slang such as hash tags into proper English. Through three simple UDFs, the built-in search engine was transformed into a comprehensive conceptual relevancy filtering application for Twitter.

A text analytics pipeline

Once sought-after posts are found, an organization can set its sights on the capability to perform deep analytics on these posts. The InfoSphere BigInsights SDA tool offers capabilities to analyze sentiment, buzz, and intent centered around specified entities in social media posts. Although SDA is generally the only tool an organization needs for analyzing social media, any organization interested in creating a custom social media classifier can also deploy open source software on its system.

For example, GATE Developer is very useful for creating full analytics pipelines for complex analysis of text in a plug-and-play fashion. By leveraging the TwitIE pipeline, analysts can access domain-specific components for performing tokenization, tagging parts of speech, normalizing tweets, executing hash tag tokenization, and extracting emoticons. These components are actually built specifically by developers in the GATE community for analyzing Twitter, and they can be quite accurate.

The capability to add general components on top of domain-specific components to provide additional functionality, though with a little less accuracy, is another benefit of using GATE. In the tweet analysis example, dependency structure analysis, coreference resolution, and extraction of predicate argument structure were added on top of the base system to enable extremely fine-grained textual analysis.

The states of GATE pipeline applications can be saved and deployed through the GATE Embedded Java interface. The pipeline created for tweet analysis was easily integrated as a Java UDF for Jaql, which eliminated the need to navigate Apache Hadoop and MapReduce. The Java code was written once, and Jaql handled the integration with the Hadoop Distributed Files System (HDFS) every time and in every way the pipeline was deployed in InfoSphere BigInsights.

Machine-learning models

Ultimately, the goal of the Big Data Stampede team’s effort in implementing NYC 311 tweet analysis was to fuel machine-learning models. The InfoSphere BigInsights Big R tool allows for rapid training, comparison, and deployment of a wide range of machine-learning models, and it hides all of the complexities of MapReduce from data analysts (see figure ).

Applying Machine Learning to Text Analytics – figure

InfoSphere BigInsights Big R tool in the web-based RStudio console

The conceptual fuzzy search application and text analytics pipeline application allowed machine-learning models to draw on fine-grained semantic understanding to enable high accuracy in text analytics of tweets. The infrastructure that was built enables comprehensive accuracy increases for analytics of social media posts—all handled efficiently using InfoSphere BigInsights.

Insight through text analytics

Text-based social media can offer tremendous opportunities for organizations to enhance the value of their operations by deriving insight from these data sources. By combining open source technologies with InfoSphere BigInsights, analyst teams can build machine-learning models capable of extracting deep meaning through analysis of textual social media posts. These models can be applied to a wide range of text analytics scenarios for insight discovery and application logic deployment that can add business value in many industries.

1 For more information on guidance to get started with a machine-learning project, visit the IBM Big Data Stampede site.
2Enhancing Survey Data with Related Social Media,” by Matthew Riemer, IBM Data magazine, June 2014.

[followbutton username='@mattriemer' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']