Big Data In Danger of Definitional Overkill
Big data evangelization is not for the faint of heart. At a recent industry panel that I was on, the moderator joked that there must be a petabyte-scale big-data cluster somewhere brimming with definitions of big data.
Imagine trying to cram all of that into one’s puny human cranium. At the very least, a big-data evangelist has to be crystal clear on how you yourself define and scope the topic. You must be prepared to defend and elaborate on it constantly. You must listen to and consider other points of view. And you must continually evolve your thinking, because this stuff doesn’t hold still for a moment.
But I can’t help myself: I tend to chuckle when encountering yet another blog, article, podcast or live forum where the focus is on defining “big data.” For my purposes, a simple phrase—“advanced analytics at extreme scale”—summarizes the core scope of this paradigm.
I’m trying not to grow jaded in my old age, though, so I regularly glance at other people’s definitions to see if anything rises above the pointless rehash. For example, there’s this recent article that aggregates 25 (very wordy) definitions of big data sourced from knowledgeable industry observers.
Is there anything noteworthy in any of this? As far as I can see, not much. But I notice several patterns in these, in terms of emphases, not all of which I agree with. The principal emphases are several.
First, many of these people equate big data with “lots and lots of data.” I take issue. The chief focus of the big-data paradigm is on the cool things you can do with lots and lots of data. IMHO, business and consumer applications that leverage advanced analytics at extreme scales (petabyte, real-time, multi-structured) are the coolest.
Second, many of them equate big data with some vague notion of what “traditional databases” can’t do. Almost nobody makes an effort to define “traditional database” in any concrete way, though they’re usually alluding to relational technology. They tend to ignore the inconvenient fact that many analytics-optimized RDBMSs can and do scale into the petabytes, can handle real-time processing and ingest from multi-structured sources. In other words, relational technology is at the heart of the big-data revolution.
Third, many of them focus their discussion on unstructured data, as if these were the only data types that need massive “3 Vs” scalability. Those who narrow big data in this way tend to narrow their platform focus correspondingly to just Hadoop and NoSQL.
Fourth, none of them incorporates stream computing, complex event processing and in-memory technologies into their definitions. That’s truly odd, as if the three-legged stool of volume, velocity and variety is lacking one of its legs. The extreme scale of real-time, continuous, low-latency data/analytics is fundamental to many big-data applications.
I try not to be argumentative, but when I come across these misconceptions in my daily rounds I must set the record straight. That’s my job.
However you define the term, I’m also a bit jaded about claims that big data has been around as a modern concern since well before I was born. It’s not as if my birth signaled the dawn of all human civilization. But it’s simply that I’ve been around the block often enough to realize that memory is a slippery thing. There’s a natural human tendency to reverse-engineer the past to conform with the shape of things as they are now or are likely to become.
With that in mind, I chuckled, while learning a thing or two, when reading a recent article called “A Very Short History of Big Data.” It’s a decent historical chronology of information glut, explosion, overload, tsunami...or whatever term you prefer to describe the perennial too-much-ness of it all. And the author crisply laid out their scope right up front—“the story of how data became big”—so that you can quickly grasp the connection to the “big data” of which I, for one, am an evangelist. They start the story 70 or so years ago, which I suppose coincides with World War II.
After so many years, this take on the perennial topic has become such a cultural cliché that we can safely ignore yet another vague rehash of the “data, data everywhere” meme. Trying to follow every popular twist and take on this topic is a recipe for chronic vertigo.
Bringing advanced analytics into the discussion nails big data directly to diverse applications—-digital marketing, micro-segmentation analysis, next-best offer and advertising optimization—where there is clear business value. If we scope the advanced analytics discussion even more tightly to those use cases where extreme scale delivers even more business value—as in this blog I wrote—we can articulate a more compelling value story.
Taking this value discussion one step deeper into platforms, I would argue that the best platforms and tools for advanced analytics are those with the headroom to grow from “small data” to extreme scales as your needs evolving.
So, getting back to the “Very Short History” article, I was hoping it would discuss the evolution of data platforms to scale out along any or all of the “3 Vs.” But it doesn’t. So be forewarned.
One last quibble. The article quotes a February 2010 article in The Economist: “Scientists and computer engineers have coined a new term for the phenomenon [of information explosion]: ‘big data.’”
No, no no! That is absolutely wrong. Marketing professionals coined the term “big data,” and it was during that specific year when the data/analytics profession began to march behind it.
I should know. Back in my industry-analyst days, I remember noticing the trend as it gained steam.