Introduction to text mining

Senior Data Scientist/Researcher, JForce Information Technologies Inc.

The amount of textual information available both on the World Wide Web and in institutional document repositories has undergone exponential growth. But as ever more textual information becomes available, organizations can find themselves hard pressed to obtain meaningful information from huge masses of textual data. Accordingly, methods of text mining have been created that allow vast amounts of textual information to be accessed and analyzed. Text mining, generally, refers to the work of any system that analyzes large quantities of natural language text to identify lexical or linguistic usage patterns of interest, extracting potentially useful information.

Get a handle on information exchange

Unlike data stored in databases, text that is to be mined for insights is unstructured and amorphous, making it difficult to process algorithmically. But processed it must be, for in modern culture, text is the most common vehicle for formal information exchange. Thus text mining usually processes texts that communicate factual information or opinions.

The motivation for text mining is compelling, even when it achieves only partial success. Because text is the most natural method for storing information, text mining is commonly thought to embody greater commercial potential than other forms of data mining. And common wisdom seems right in this case: One study has indicated that 80% of company information is contained in text documents. However, text mining is a much more complex task than data mining is, for text mining requires the processing of inherently unstructured, fuzzy data.

Make sense of your data resources mining is a multidisciplinary field that involves information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning and data mining. Despite its complexity, text mining is required before organizations and individuals can make sense of their vast information and data resources, leveraging their inherent value. Such resources must first be processed—accessed, analyzed and annotated, then related to existing information and understanding. Data so processed can then be mined to identify patterns of interest and extract valuable information, including new insights.

But how information and data resources are analyzed depends on their format. Structured data can be relatively easily mined, for its structure aids processing. However, using a computer to automatically analyze information contained in documents is much more difficult. Most digital documents comprise unstructured text containing flat data—rather than structured and meaningful information—which cannot be directly and automatically processed usefully by a computer.

Thus text mining involves more complicated processes than does structured data mining, and from these processes arise conflicts with copyright law. Yet the volume of text generated by business, academic and social activities in, for example, competitor reports, research publications and customer opinions on social networking sites, requires text mining in some form.

Unearth business insights

Technologies designed to solve problems such as topic detection, tracking and trending, in which a machine automatically identifies topics discussed in a text, hold great promise. Text mining tasks can include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, entity relation modeling and much, much more. Not surprisingly, text mining thus has numerous applications, among them aiding cutting-edge research through analysis and classification of news reports, filtering emails and spam, hierarchically extracting topics from web pages, automating ontology extraction and management and collecting competitive intelligence.

In this blog series, you’ll discover how to use kernel methods to identify patterns in text, and you’ll dig into the text mining process, including machine learning, by exploring statistical learning methods and their use in text mining, learning how to use classification and clustering in text mining. To learn more about how text mining and other advanced analytics can help your organization gain insights, check out this informational IBM Analytics resource page.


  1. McDonald D., Value and Benefits of Text Mining. This report has been commissioned by JISC, March 2012.
  2. Sebastiani, F., Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47, 2002.
  3. Srivastava A., Sahami M., Text mining classification, clustering, and applications, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series 2009.
  4. Tan A. H., Text Mining: The state of art and the challenges, In proceedings, PAKDD'99 Workshop on Knowledge discovery from Advanced Databases (KDAD'99), Beijing, pp. 71-76, April 1999.
  5. Witten, I. H., Text mining, In Practical handbook of internet computing, edited by M.P. Singh, pp. 14-1 - 14-22. Chapman & Hall/CRC Press, Boca Raton, Florida, 2005.