Response: Top 5 Myths about Big Data

Big Data Evangelist, IBM

I've just responded inline on Brian Gentile's excellent Mashable blogpost called "Top 5 Myths about Big Data." I recommend that you read that post. Here's the verbatim text of how I responded.

"All of these are good points, Brian. You’ve keyed into the 5 top requirements/trends/approaches powering the big data revolution. It’s important to point out, as you have, that big data is much broader in scope. Let me add a few additional thoughts on each myth:

Myth #1: Big Data is Only About Massive Volume. Above and beyond the 3 Vs, it’s important to note that big data is about addressing the leading-edge business challenges that require analyzing massive volumes, real-time velocities, and/or multi-structured varieties of data. I’m talking about the notion of “whole-population analytics” against the entire population of data, rather than just the traditional capacity-constrained samples/subsets. Being able to drill into the entire aggregated population of, say, customer data, including rich real-time behavioral data, enables you to do more powerful micro-segmentation, fine-grained target marketing, nuanced customer experience optimization, and next best action.

Myth #2: Big Data Means Hadoop. As you’ve noted, big data depends on massively parallel processing (MPP), plus in-database analytics, on analytic databases, file stores, document stores, and persistence infrastructures of various sorts: HDFS, HBase, Cassandra, RDBMS-based EDWs, columnar, key-value, graph databases, etc. We’re seeing far more hybrid (Hadoop + EDW + NoSQL + whatever) big data deployments these days than ever before, and range of hybrid models will continue to grow.

Myth #3: Big Data Means Unstructured Data. I agree with your focus on “multi-structured data,” but disagree with your statement that “the consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored.” Considering that RDBMS-based EDWs are the established heart of big data, this statement is wrong. Better to say that “these data types vary in sources, in formats, and in when the data schema is known or defined: at capture, at storage, or when the data is used.”

Myth #4: Big Data is for Social Media Feeds and Sentiment Analysis. You should insert the word “only” between “is” & “for” to make this myth crystal-clear; because, clearly, one of the killer apps for big data is social media monitoring and sentiment analysis. But point well-taken: big data is the powerhouse refinery that is enabling business everywhere to continuously harvest intelligence not only from social media, but from log, event, sensor, geospatial, and other sources that aren’t strictly “social.”

Myth #5: NoSQL means No SQL. What’s happened is that “NoSQL” has greatly expanded the range of “QLs” in the big data arena: including Hive QL, Cassandra QL, SemWeb’s Sparql, etc. The confusing variety of big data “QLs,” including good ol’ SQL, absolutely demands some sort of query virtualization/abstraction/semantic layer, especially as the back-end data platforms (MPP EDW, Hadoop, NoSQL, graph, etc.) proliferate all over the cloud.