Spark-based machine learning for capturing word meanings

Lead Data Scientist, IBM

When someone can take a very challenging present-day problem and translate it into a problem that has been studied for centuries, the result can be amazing. Such is the case with Word2Vec, a method for transforming words into vectors. Text is unstructured data and has been explored mathematically far less than vectors, both historically and today.

Historical mathematics and text data

Physicist and mathematician Sir Isaac Newton may have been the first person to study vectors in the context of forces in physics. The concept of vectors has almost three centuries of scientific maturity. Mathematical exploration of text data is a concept with only a few decades of maturity. Similarly, I have worked with vectors for more than half of my life, but only explored text data for less than a year. 

The application of mathematical thinking to text data is especially important now. Today, the value of data is understood but not actualized. The majority of business-relevant information originates in unstructured form, which is primarily text. This data can be invisible to and unusable by business, education, government and healthcare until it can be read. Mathematical exploration of text data can yield insights that translate into better decisions made by doctors, entrepreneurs, marketers and teachers. 

In my ongoing endeavor to make text data readable, I applied Word2Vec to generate vectors that capture word meaning and enable arithmetic operations associated with words. For example: 

vector(king) + vector(woman) – vector(man) = a vector close to vector(queen) 

Word computation

Thomas Mikolos et al. at Google proposed the Word2Vec method in 2013. This algorithm is based on networks and maps a corpus of text to a matrix in which each row is associated to a word in the input text data—for example, tweets, product reviews, playlists and so on. The resultant vector space can be utilized in a variety of ways, such as measuring distance between words. Therefore, given a word of interest, the aforementioned vector space can be used to compute the top n closest words.

For example, a model I built using 30 days of Twitter data yields the 5 closest hashtags (words) to #deeplearning: 

  • #machinelearning
  • #ml
  • #smartdata
  • #predictiveanalytics
  • #datascience 

The Word2Vec implementation used in this case is from Apache Spark ML, a machine learning package that’s part of Apache Spark. If you’re interested in building your own Word2Vec model, take a look at this notebook that runs on the IBM Data Science Experience (DSX). And be sure to attend Spark Summit Europe 2016, 25–27 October 2016, in Brussels.

Attend Spark Summit Europe 2016