Merging datasets using graph analytics

Software Engineer, Lynx Analytics

Imagine it: Just when you’ve perfectly decorated your living room, you receive a beautiful gift from your boss. It’s a life-size golden statue of a camel. How would you feel about the gift?

I had just such a feeling just a few weeks ago. We had our call data perfectly organized. We could answer any query. But then came the golden camel: We received a large batch of social network data that did not fit in with anything we had before.

Most profiles in the call data lacked personal and demographic details, so the social network profiles were very valuable. But there was no way to connect the two datasets. Some call profiles had names, so we tried joining by names. But we found that names were much less individual than we expected; worse still, the names that were unique were so because of misspellings. Optional middle names, anglicized names, nicknames and even pop culture references all blocked our attempts to make sense of the data. And although some social profiles were associated with phone numbers, we found that such numbers too often belonged to a relative or friend.

How could we connect the social network data with our existing data? similarity fingerprinting

The profiles had no identifying attributes other than names, phone numbers and internal IDs (which were, of course, independent in the two datasets). But that was not to say that nothing was left that could identify the profiles. Being a graph analytics firm, we naturally turned to the graph structure to find the solution.

Both datasets included graph edges: calls in one and social relationships in the other. If two profiles represented the same person, we expected that they would be connected to other profiles that also represented the same people.

We used the ambiguous, low-quality connections generated from the names and phone numbers as the seeds for the similarity calculation. After we had calculated the similarity scores for the potential profile mappings, we had to choose the best matching. Doing so was tricky, because we couldn’t just pick the highest-similarity social profile for every call profile—that social profile might well already be taken by another call profile.

Such a situation, in which we need to find the best matching in a weighted bipartite graph, poses what is known as the stable marriage problem. It is a classical problem that has a well-known solution, the Gale–Shapley algorithm.

Implementation with Apache Spark

A popular model of distributed computation on graphs known as Pregel was published by Google researchers in 2010. Pregel is based on passing messages along the graph edges in a series of iterations. Accordingly, it is a good fit for the Gale–Shapley algorithm, which starts with each “gentleman” (a vertex on one side of the bipartite graph) sending a marriage proposal to its most preferred single “lady” (a vertex on the other side of the bipartite graph). The “ladies” then marry their most preferred suitors, after which the process is repeated until there are no more proposals to be made.

The Apache Spark distributed computation engine includes GraphX, a library specifically made for executing distributed computation on graph data. GraphX provides an elegant Pregel interface but also permits more general computation that is not restricted to the message-passing pattern. Thus GraphX is a great option for implementing our plan to connect the call and social networks.

The capabilities of the Apache Spark graph analytics application on which I am working are similar to those offered by GraphX. While GraphX is a pure programming API, we also added a graphical user interface that allowed us to visually and statistically explore the datasets easily, evaluating different approaches for mapping the social profiles to the call profiles.

Using Apache Spark, we built an end-to-end fingerprinting tool to identify matching candidates among two independent data sets, calculating a similarity score and solving the stable marriage problem. Integration with a graphical environment not only saved us time, but also allowed us to easily experiment by tweaking the parameters of the tool.

The answer

Thanks to Apache Spark, we were able to figure out who was who with surprising accuracy as we merged the social network dataset with the call data. The enriched profiles and the added personal connections later contributed to a number of insightful results—and the camel is now in harmony with the rest of the living room.

For more information on IBM Big Data & Analytics, please visit To learn more about Apache Spark, please visit Also, check out the following resources to discover more about IBM’s deep commitment to Spark and engagement with Lynx Analytics and other IBM partners in the Spark market:

Additionally, please sign up for IBM’s forthcoming Apache Spark as a Service on Bluemix.