The Context Conundrum
Big data has no reason to have context, and therein lies the challenge of nonrepetitive data
Context often tends to be taken for granted. Consider a world without context. For example, what does 7 mean? Does it refer to the days of the week? Does it represent the seas of the world? Does it reference the number of dwarfs in Snow White? By itself, the number 7 means nothing, and for that matter any number by itself means nothing. For any number to have meaning it must have context attached to or associated with it.
People generally assume the number 7 or any number always has context. After all, a standard database management system (DBMS) has always had context. To begin, there is the context of data being inside a record. And the data always has an attribute, which is context. When we see 555-9087, for example, we know that it is a phone number. Or when we see $5.95, we know that it is the price of something. So what’s the big deal with context? The meaning of the data is known when a DBMS provides context—or at least there is a clue as to what the data means.
Stated differently, a precondition for data is to be able to identify its context if it is to participate in a standard DBMS. There is no such thing as data without context in a DBMS.
Nonrepetitive data is challenging
In the world of big data, however, much of the data is unstructured data, and one common type of big data is textual data. Unlike a standard DBMS, in big data there is no reason for text to have any context—which presents a profound problem.
One way to look at big data is that it can be made up of repetitive data and nonrepetitive data.* In this sense, the issue of context is narrowed down a little. With repetitive data such as a call-record detail, click-stream data, metering data, and so on, context is pretty easy to establish. Each repetitive record is like every other repetitive record, and the metadata that describes the context in one repetitive record offers the key to understanding the meaning of all the data in the other repetitive records. Finding context for repetitive big data is therefore actually not a problem—or at least not a big problem.
However, finding and establishing context for nonrepetitive data in big data is another matter entirely. One encounters several problems when attempting to find and establish context in nonrepetitive big data—which is data such as emails, call center data, corporate contracts, warranty claims, and many other types of data. The following examples indicate some of the problems with establishing the context of nonrepetitive data in big data:
- The context for one nonrepetitive record—such as a textual description of a call with customer service—is completely different than the context for any other record of nonrepetitive data. For example, one person may have had a good experience, the next person may have had a bad experience, and another person may have just wanted to pass the time of day. The messages the customers write as nonrepetitive records of data are based on the context of entirely different sets of experiences.
- One nonrepetitive record may have the basis for establishing its context in the record, while the next nonrepetitive record may have no basis whatsoever for establishing its context. For example, one person’s email writes about being a student, taking a class in geometry, and trying to solve a problem. But the next email says nothing about who is writing it.
- The context for data in half of the nonrepetitive records in a big data repository may have an entirely different context from the data found in the other half. For example, one set of messages generated on July 31, 2011 includes references about how hot the summer weather is; the other set of messages written on August 1, 2011 includes references about stock markets plummeting worldwide.
The list of problems with context for data in nonrepetitive records goes on, and these problems demonstrate the many reasons why the context for nonrepetitive data in big data repositories can be difficult to determine and establish.
Context is key
Is establishing the context of nonrepetitive big data important? The answer to that question is an emphatic “yes.” If there is to be a good outcome for whatever nonrepetitive data is found in big data, the most important question is how to understand and establish context. Without context there is the very real likelihood that information derived from nonrepetitive data may be completely misunderstood.
* “Untangling the Definition of Unstructured Data,” by W.H. Inmon, IBM Data magazine, July 2014.