Deriving insight from text mining and machine learning

Senior Data Scientist/Researcher, JForce Information Technologies Inc.

Text mining is the analysis of data contained in natural language text. It can help an organization derive potentially valuable business insights from text-based content such as Word documents, emails and postings on social media streams such as Facebook, Twitter and LinkedIn. It has become an important research area for the application of machine learning techniques in the study of information retrieval and natural language processing. In a sense, it is defined as the way of discovering knowledge from ubiquitous text data that is easily accessible over the Internet.

Text mining is a process comprising several steps as shown in Figure 1.

  • Step 1: The documents that suit the application are determined in large volumes of textual data. Document clustering methods are used to solve this problem. These methods are unsupervised learning methods, but the most popular document clustering methods are k-means clustering and agglomerative hierarchical clustering.
  • Step 2: Text is cleared—that is, it is removed from ads on web pages; normalized text is converted from binary formats; tables, figures and formulas are dealt with; and so on. Then, the process of marking up the words in a text with their corresponding parts of speech begins. There are two approaches for marking up the words: a rule-based approach that depends on grammatical rules and a statistically-based approach that relies on different word order probabilities and needs a manually targeted corpus for machine learning. Then, the sense in which a word has a number of distinct senses in a given sentence is determined. Finally, semantic structures are defined. There are two ways to determine semantic structures: full parsing, which produces a parse tree for a sentence, and chunking with partial parsing, which produces syntactic constructs such as noun phrases and verb groups for a sentence. Producing a full parse tree often fails due to grammatical inaccuracies, unusual words, bad tokenization, incorrect sentence splits, errors in POS tagging and so on. Hence, chunking and partial parsing are more commonly used.
  • 3: The words (features) are determined for text representation. The primary methods for document representation are bag of words and vector space. These approaches aim to determine which features best characterize a document.
  • Step 4: The features’ dimension is reduced. For this, irrelevant attributes are removed.
  • Step 5: Text mining process merges with the traditional data mining process. Classic data mining techniques such as clustering, classification, decision trees, regression analysis, neural networks and nearest neighbor are used on the structured database that resulted from the previous stages. This is a purely application-dependent stage.
  • In the final step, if the results are not satisfactory, they are used as part of the input for one or more earlier stages.

Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. It explores the study and construction of algorithms that can learn from and make predictions about data. Such algorithms operate by building a model from example inputs to make data-driven predictions or decisions, rather than following strictly static program instructions.

Machine learning is closely related to and often overlaps with computational statistics—a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. It is employed in a range of computing tasks where designing and programming explicit algorithms is infeasible. Example applications include spam filtering, optical character recognition (OCR), search engines and computer vision. Text mining takes advantage of machine learning specifically in determining features, reducing dimensionality and removing irrelevant attributes. For example, text mining uses machine learning on sentiment analysis, which is widely applied to reviews and social media for a variety of applications ranging from marketing to customer service. It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation, affective state or the intended emotional communication. Machine learning algorithms in text mining include decision tree learning, association rule learning, artificial neural learning, inductive logic programming, support vector machines, Bayesian networks, genetic algorithms and sparse dictionary learning.

To learn more about text mining and other advanced analytics, check out this informational IBM Analytics resource page.


  1. Ben-Gal I., Bayesian Networks, In Ruggeri F., Faltin F., and Kenett R., Encyclopedia of Statistics in Quality & Reliability, Wiley & Sons 2007.
  2. Bishop C. M. Pattern Recognition and Machine Learning. Springer 2006.
  3. Carbonell J. G., Michalski R. S., Mitchell T. M., An Overview of Machine Learning., In: Michalski R. S., Carbonell J. G., Mitchell T. M., Machine Learning: An Artificial Intelligence Approach., Springer Berlin Heidelberg, Symbolic Computation, 1983.
  4. Feldman R. et al.; “Knowledge Management: A Text Mining Approach”. Proc. of the 2nd Int. Conference on Practical Aspects of Knowledge Management (PAKM98), Basel, Switzerland, October, 1998.
  5. Feldman R., Sanger J., The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007.
  6. Freitag D., Machine Learning for Information Extraction in Informal Domains., Machine Learning 39 p:169-202, 2000.
  7. Kohavi R., Provost F., Glossary of terms. Machine Learning 30: 271–274, 1998.
  8. Michie D., Spiegelhalter D., Taylor C., Machine Learning: Neural and Statistical Classification., Overseas Press; 2009 edition (August 28, 2009).
  9. Wernick, Yang, Brankov, Yourganov and Strother, Machine Learning in Medical Imaging, IEEE Signal Processing Magazine, vol. 27, no. 4, pp. 25-38, July 2010.