Master Data Management (MDM) vs. Sensemaking

IBM Fellow, Chief Scientist for Entity Analytics, IBM

I get asked from time to time how Master Data Management (MDM) relates to my work in Sensemaking Systems.  There is sufficient confusion to warrant a blog post on this subject, so here it is.

Different missions, different tools.  Some organizations will use one or the other; most organizations will want both.

MDM in a nutshell

MDM is about helping companies gain control over business information by enabling them to manage and maintain a complete and accurate view of their master data.  MDM deals with a known problem.  Master data is “intentional” business data that is structured and flows from systems under one’s control.  Common master data domains include customers, products, and accounts.  The MDM outcome: organizations have higher quality, reliable and consistent master data records.

Sensemaking Systems in a nutshell

Sensemaking is about helping companies make sense of their diverse observational space, ranging from data they own and control (e.g., structured master data) to data they do not or cannot control (e.g., externally-generated and less structured social media).  Sensemaking deals with uncertainty around an unknown and every changing domain – as if operating over an arbitrary number of puzzles, each unclear, incomplete and riddled with inconsistencies.  Common Sensemaking domains include customers, watch lists, social circles and investigations.  The Sensemaking outcome: organizations make better decisions, faster.

Both MDM and Sensemaking Systems make organizations more competitive.  Both can serve real time or batch missions.  And both play together quite nicely.

Now to highlight some stark differences between these two missions and hence the reason the algorithms used in each domain are fundamentally so different.

Rich Data vs. Poor Data

Because master file data is owned and controlled by the organization, this data tends to be more complete (feature rich).  For example, customer records often contain a name, address, and some form of further identification like a date of birth or tax identification number.

While high fidelity data is better, Sensemaking Systems must routinely attempt to associate low fidelity observations e.g., connecting the dots between a third party watch list containing only name, nationality and date of birth fields and an internal cyber investigation database containing only email and IP addresses.

The algorithms required to deal with low fidelity data require union/set-based entity resolution algorithms which are fundamentally different than the record linking algorithms used when dealing with rich data.

Bad Data Bad vs. Bad Data Good

Organizations that deploy MDM seek to improve data quality at collection, e.g., detection and elimination of duplicate accounts during data entry – the purpose being clean records.  This kind of control over internally generated master data means less bad data gets in the front door.  And in this pursuit of “golden” master data records … bad data is bad.

Contrast this with Sensemaking Systems.  In this class of system, errors such as misspellings and numeric transpositions are valuable regardless of whether these errors have been generated by accident or are professionally fabricated lies created by sophisticated criminals.  When a criminal uses different alias and date of birth combinations, Sensemaking Systems meticulously record this untrusted information, permit this ambiguity to fester, and monitor numerous versions of truth.  This “bad data” is sometimes the only source of an important clue. 

The algorithms used to harness an out of control observation space are quite different than those algorithms used to manage internally controlled, higher quality records.

Manageable Data vs. Unimaginable Big Data

For most organizations, master data will not exceed 500M records.  At these scales, the automated triage of changes to master data results in a small to modest number of records that must be arbitrated following human review.  Fortunately, as an organization’s master data is a closed domain … there is a real chance humans can keep up as “uncertainty” is minimal and manageable. 

Sensemaking Systems must be able to deal with extremely large data sets – potentially involving tens to hundreds of billions of records – being generated from an ever more diverse range of data sources (e.g., from Twitter to OpenStreetMap).  At this volume and diversity the “uncertainty” is beyond the capacity of virtually any human review.

The algorithms used to deal with the enormous ambiguity tend to be very different than algorithms that routinely face smaller, more controlled data sets that routinely benefit from human participation.  Sensemaking algorithms have also been seen to actually get smarter and faster over ever growing datasets – an exciting feature for customers staring information overload in the face.

Firm Facts vs. Predictions

MDM is about being in control of the master data that an organization generates and about the truth of this data.  Master data becomes so reliable that its assertions are treated as fact – and these facts are so firm they are rarely reversed or invalidated.

Because Sensemaking Systems are attempting to harness an out of control observation space, assertions about context are constantly postulated, reevaluated and retroactively readjusted.  This accumulating context is viewed more like a prediction that is expected to evolve as more observations are considered.

The algorithms required to constantly reassess and context correct the historical observations with every new observation, in real-time, are fundamentally different than algorithms that do not have this requirement.

Event Triggered vs. Everything a Trigger

MDM processes are triggered during master data events.  MDM is used to guide the collection and management of master data.  For example, real-time duplicate detection during:

  • Customer on-boarding at a financial services company
  • Account set-up at a telecommunications company
  • New product creation at an insurance company
  • Patient admission at an emergency room or doctor’s office

Sensemaking Systems are more akin to “intuition support systems” – treating every piece of data that falls within the organization’s observational space as the potential to learn something new, detect something relevant, or as the basis for an alarm.  As each new observation arrives in the enterprise, the enterprise needs to know if it just learned something new, confirmed something it already knew, or has something wrong.  And with each new observation it must answer the hardest question of all (for a computer), “How does this relate to what I already know?  Does this matter and, if so, to whom?”  For example:

  • As fast as the consumer joins the insurance company’s Facebook group, the insurance company recognizes the consumer’s significant social influence and is now instantly prepared to respond accordingly during the next business interaction
  • The customer’s on-line retail purchase this morning is used to render a more relevant “next best action” recommendation later that morning when they visit the physical store front, e.g., the coupon on the back of the receipt no longer includes the item they already purchased
  • While processing new deceased persons’ data from the Social Security Administration, notification is send to the marketing department suggesting they “inactivate” those persons in the bank’s prospect database
  • During the creation of a casino arrest report, an alert is generated when the subject provides the same home address as that provided by the dealer on his employment application

The algorithms necessary to publish insight at ingestion based on incremental context accumulation and complex event processing are fundamentally different than algorithms that do not have this requirement.

MDM + Sensemaking

If an organization has a well constructed MDM system, it has a good handle on its own data and processes. This MDM data being a great starting point for Sensemaking Systems.  Nice, but not required.

Ironically, when Sensemaking Systems are used in weak signals domains e.g., the detection of criminal activity, irregularity and errors in the data being presented by adversaries happens to be extremely useful.  So while MDM systems (by design) clean up data entry, Sensemaking Systems should be provided the good, bad, and ugly data – pre- and post- cleaning.  Point being: if you polish every piece of data to perfection, one may never find the weak signal.

Different technologies.  Different missions.   And a simple integration story.


Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems

Sensemaking on Streams – My G2 Skunk Works Project: Privacy by Design (PbD)

Entity Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems

It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

Smart Systems Flip-Flop

The Data is the Query

Data Finds Data

Big Data. New Physics


MDM, governance