Context is key to deriving analytic value with Hadoop

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

When considering a re-architecture, upgrade or even a switch to a new technology such as IBM PureData System for Analytics, powered by Netezza technology, practically every discussion with enterprise data management leaders returns to big data and unstructured data. The recent rollout of Fluid Query (bravo, Netezza) opens up a whole new canopy of opportunities—as well as frustration for those who were already confused about big data in the first place.

So let’s try to agree on the definition of the various terms being tossed around in this domain, because a lot of confusion stems from misnomers or misplaced understanding. This glossary is by no means definitive but it does at least reflect reality. data: Since the early 2000s, this meant large-scale structured data. Only a handful of people ever talked about it. One day, big data meant unstructured data and it became a buzzword overnight. Confusion ensued.

Big data context: When we draw data from the vast ocean, we remove it from native context. If we attempt to store the ocean itself, we likewise remove it from context. And context is where data derives its strongest meaning. Let’s say a large corporation acts on what it gleans from its data but its competitors respond with a counter play—so when this data is returned to the ocean, these new activities have changed it. We have to draw again and retry, because acting on the context actually changes the context (as if someone sees the future and acts on it, thereby changing the future before it happens). Big data context is always in flux, hence the need for a constant flow of it.

Analytics: A forensic exercise that tries to attach or derive additional context to or from information, especially correlating it from disparate sources. Time is the greatest enemy of forensics, because it so rapidly erodes context. After all, which is richer in useful context: a 30-minute-old crime scene or one that’s 30 years old? The moment we draw data from a source, the forensic time-erosion clock starts ticking and the data becomes more stale with each passing moment.

Data warehouse: This is an attempt to impose larger-scale context on data from disparate sources, accepting a reasonable “staleness” factor.

Latency: One form of analytics dredges the vast ocean for correlated information. It’s prone to extremely high latency and analysts may feel minutes passing like hours. Another form manipulates pre-calculated or structured data, and is zero-latency by design. These analysts want an immersion where they get results so rapidly that hours have passed like minutes. They are far more productive in such “speed-of-thought” immersions.

Hadoop: This consists of storage and retrieval technology with higher latency but a broader range of storage options and contextual capture. Its retrieval paradigm is largely a “noise filter” that tries to strip trash from information, the core mission of MapReduce. Many tech demos and even television shows depict people applying filters to reduce noise and derive greater meaning or more concise context.

Data: These are streams of bytes flowing through systems or stored on media. These bytes have no meaning or value without their context (at a minimum, various forms of metadata).

What do these definitions mean? Whether you accept them or not, realize that the focus isn’t on the data, but its context.

For example, in pharmacological analytics, researchers will compare patients taking certain drug combinations with the outcomes of patients with similar diagnosis or drugs, or even the counteractions of other drugs and so on. The patients’ names are irrelevant. In another example, looking at phone records can uncover an incoming call from an oncologist, then outbound calls to a mother, father and close relatives, followed by calls to various people who arrange funerals, and finally a call to a suicide hotline. Doesn’t all this context construct a narrative? In genealogical searches, connections to relatives are found in documents such as marriage licenses, family Bibles, birth or death certificates and so on. Analysts both derive and impose context on these widely disparate artifacts, the results of which tell the story.

In short, context is king, not the data itself. If there’s a strong handle on context, the data is already under control. This is why patterns are so powerful, because they are applied to steer or manage the context, indirectly driving the data outcome.

Hadoop’s role and strength in the warehouse is in its extraordinary storage capacity for both data and context. Sadly, many folks skimp on the context because it adds largesse to the storage requirements. But isn’t capacity what we wanted Hadoop for as well? Not just capacity for storing data, though. We need to set aside even more storage to maintain the data’s context. Yet still more people ignore on context because they don’t see its value, or they don’t understand that richer context, not higher volume, increases the value of data. Analysts will confess that they get the same statistical result with 1,000 values as they do with 100,000, so size truly doesn’t matter.

Analysts scan for data in context, perform algorithms on it in context and then summarize these results to another context entirely. The context drives everything. Without it, data has no meaning whatsoever. Many leaders speak of Hadoop in reverential terms as though simply installing it will solve the world’s problems. But analysts don’t need a lot of data; they need high-quality data that is richly immersed in context.

In the book Red Storm Rising, Tom Clancy wrote that the Russian Politburo called upon an analytics team that would always arrive with three versions in hand: the best-case, the worst-case and the in-between. Before offering their report, the analysts would read the sentiment of those leaders present and then choose the report that seemed to best fit their mood. In this case, Politburo members would be able to act only on that information that fit their attitude in the moment, not that which may be the most useful to them (because the analysts were too afraid to report it).

Analysts make their living on the relevance and value of their results. Can a decision-maker become his or her own worst enemy by allowing moods or sentiments to affect the analytic results? Of course they can. At one site, a leader said his underlings constantly lied to him and he couldn’t trust any of them. I thought this hard to believe until I attended my first status meeting. He exploded with rage that the schedule was so far behind, and ranted to them, red-faced, for nearly 20 minutes. Were they lying to him, or walking on broken glass and worrying what might trigger his next tantrum? In such cases, the employees could not trust their boss with what he would do with their analysis, so they sugarcoated it to avoid a reaction.

How should decision-makers regard their analysts or analytic environments? If they only express contempt for the people and their products, who’s to say that the analysts won’t tell a tale to avoid an angry response? In these instances, Hadoop cannot make a person a better decision-maker or mitigate the emotions or sentiments of the humans involved in the analysis-delivery process. Analytics is a very human process. Hadoop is supposed to make it faster, simpler and so on—but people impose intelligence upon Hadoop, so Hadoop can deliver only at a level compatible with its strongest architect.

Circling back, how does Hadoop derive value from analytics? That is, what is the value of big data in analytics? If analytics is about gaining statistical sentiment and knowledge from information, smaller samples are just as good as larger ones, so big, unstructured data is not valuable. On the other hand, using Hadoop’s scale to capture deeper context dramatically increases the value of the smaller data samples.

If analytics is about building reports that need to be penny-level accurate, then Hadoop and big data have no real role because MapReduce is by nature a fuzzy process. Accurate report generation is the realm of structured data, which ultimately requires non-trivial, Netezza-scaled power.

Please share any thoughts or questions in the comments.

And please register to download this recent Forrester Research study on Hadoop, in which the authors stated: "IBM assembles an impressive set of capabilities, putting predictive at the center. No matter how an organization wants to get started with predictive analytics, IBM has an option for them. The solution offers one of the most comprehensive set of capabilities to build models, conduct analysis, and deploy predictive applications: both on-premises and in the cloud. With customers deriving insights from data sets with scores of thousands of features, IBM's predictive analytics has the power to take on truly big data and emerge with critical insights."