Kernel-based pattern recognition in machine learning

Senior Data Scientist/Researcher, JForce Information Technologies Inc.

The amount of textual information available, whether online or in institutional document repositories, has seen tremendous and unabated growth. Because of the importance of obtaining meaningful information from massive amounts of text data, pattern recognition is an important issue in text mining. In particular, kernel-based methods of pattern recognition are an effective alternative to explicit feature extraction.

Kernel-based learning methods

The kernel function—a function returning the inner product between mapped data points in a higher dimensional space—is a foundational building block for kernel-based learning methods. Such learning takes place in the feature space so long as the learning algorithm can be entirely rewritten so that the data points appear only inside dot products with other data points. Several linear algorithms can be so formulated, whether for clustering, classification or regression.

The best-known example of a kernel-based system is the support vector machine (SVM), but the perceptron, principal component analysis and nearest-neighbor algorithms, among many others, also have this property. Because of their lack of dependence on the dimensionality of the feature space and their flexibility in being able to use any kernel function, kernel methods are suitable for a variety of classification tasks, particularly the classification of text.

A kernel approach to pattern analysis

The kernel approach offers a very general framework for performing pattern analysis on many types of data and can be used in a wide variety of tasks for a range of applications. It also enables the use of feature spaces whose dimensionality is more than polynomial in the relevant parameters of the systems, even though the computational cost of the pattern analysis algorithm remains polynomial.

Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space. Rather, using what is known as the kernel trick, they simply compute the inner products between the images of all pairs of data in the feature space, an operation that is often computationally cheaper than explicitly computing the coordinates would be.

Kernel functions have been introduced for sequence data, graphs, text, images and vectors. Indeed, kernel methods and pattern analysis can be considered two of the most important topics in machine learning in the last few years. Their adaptability and modularity have given rise to a variety of kernels and algorithms in a number of topic areas. In particular, well-known algorithms have been modified into kernel versions such as the Fisher kernel, graph kernel, radial basis function (RBF) kernel and string kernel.

The Fisher kernel using data to train a generative hidden Markov model, a Fisher kernel may be derived for a discriminative SVM. The Fisher kernel supplies a natural similarity measure that takes into account the underlying probability distribution. If each data item is a sequence (possibly of varying length), then each may be used to train a hidden Markov model.

A global hidden Markov model may then be constructed by taking the average of the models in the training set, allowing calculation of the extent to which a new data item would stretch the parameters of the existing model—its Fisher score. Such an evaluation requires calculation and comparison of the gradient of the log-likelihood for two data items with respect to the model, using a given set of parameters. If two data items’ Fisher scores are similar, then the items would similarly affect the model, requiring similar adaptations to the parameters from the point of view of the given parametric model at the current parameter setting.

Graph kernels

Graph kernels, a relatively new approach to graph comparison, have been proposed as a theoretically sound approach to the problem of graph comparison and one appearing to hold much promise. Indeed, upon definition of a graph kernel, an entire family of data mining and machine learning algorithms becomes applicable to graphs.

Interestingly, graph kernels employ concepts from all three traditional branches of graph comparison: They measure similarity in terms of isomorphic substructures of graphs; they allow for inexact matching of nodes, edges and labels; and they treat graphs as vectors in a Hilbert space of graph features.

The radial basis function kernel and string kernel

In machine learning, the RBF kernel is a popular kernel function used in various kernelized learning algorithms. It is commonly used in SVM classification.

Similarly, the string kernel, which operates on strings—finite sequences of symbols not necessarily of the same length—is also popular for use in learning algorithms. String kernels can be intuitively understood as functions measuring the similarity of string pairs.

To learn more about text mining and other advanced analytics methods, visit this informational IBM Analytics resource page.


  1. Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin. Training and testing low-degree polynomial data mappings via linear SVM. J. Machine Learning Research11:1471–1490, 2010.
  2. N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.
  3. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, NewYork, 1995.
  4. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press.
  5. Lodhi Huma, Saunders Craig, Shawe-Taylor John, Cristianini Nello, Watkins Chris. Text classification using string kernels. Journal of Machine Learning Research: 419–444, 2002.
  6. Gartner, T., A survey of kernels for structured data, ACM SIGKDD Explorations Newsletter (ACM) 5 (1): 58, 2003.