Week's Worth of Koby Quick Hits: July 2-13, 2012

Big Data Evangelist, IBM

Here are the five quick-hit ponderings that I posted on the IBM Netezza Facebook page over the past two weeks, during which I took a much-needed vacation. I went deeper on the themes of all in memory, big data's optimal deployment model, frictionless sandboxes, and sexy statistics. And I opened up a fresh topic: advanced visualization:


July 2: All in memory? Don't make RAM a data dumping ground!

Anytime anybody says that any new platform is the place where you should put literally ALL of your data, I get nervous.

Believe it or not, I've never called for any organization to dump all of their data in their enterprise data warehouse (EDW). I've never argued for Hadoop becoming that single uber-bucket either. There are plenty of good reasons why an enterprise should keep various subsets of their data on the EDW vs. Hadoop vs. transactional databases vs. file systems vs. whatever. Performance, security, high availability, compression, and so on – the disparate objectives you're trying to balance, as well the myriad users and applications you're trying to satisfy, call for nothing less than heterogeneous data platforms.

So it's with skepticism aforethought that I throw cold water on the dream of "all in memory"? Especially as we push into high-volume big data, does anybody see any real value in jamming petabytes into a virtualized cloud of RAM, even if it were possible now (which it isn't) and even if it were cost-effective (dream on, dudes & dudettes)?

All-in-memory is a technology-push vision of the highest order. Those who wax poetic on "speed of thought" tend to gloss over the inconvenient fact that most real-world analytic applications – especially operational business intelligence – do just fine in low-velocity (i.e., batch ingest, nightly processing, production reporting) scenarios.

Besides, even futuristic yadda-yadda-yadda-byte in-memory systems will have their capacities. At some point, you'll realize you can't store literally ALL the data in the world in there. You'll need to purge unneeded data and not treat your in-memory platform a dumping ground.

As you adopt in-memory platforms, you'll need to be as selective regarding what data gets loaded into them as you are with what you put in your EDW, Hadoop cluster, or any other database platform.


See previous quick hit on this topic


July 3: Big data's optimal deployment model? Whatever it is, virtualize it to a fare-ye-well!

You think big data architectures will become monocultures of "Hadoop uber-alles" or whatever-uber-alles at some point in the near future? Fat chance.

As I've said elsewhere, the hybrid big data deployment models will come to predominate in the most complex business and cloud architectures. Most established data analytic platforms will live on, supplemented by the newer approaches in an extraordinarily heterogeneous menagerie of "fit-for-purpose" platforms.

We are in the golden age of database innovation. The "1000 flowers" are relational, columnar, OLAP star schema, stream computing, Hadoop/HDFS, Hadoop/HBase, Cassandra, NoSQL, key-value, RDF triple-store, graph, file, document, in-memory, dimensionless, and on and on. The cross-pollination and recombinant gene-splicing among all these adjacent species will not stop. At IBM, we stay busy as a bee evolving our own portfolio of best-of-breed data platforms to stay ahead of the curve. Even if we wanted to slack off (which we don't), our customers would insist on no less.

Whatever complex big data architecture you implement, or however many tiers and database species, your primary concern should be simplicity where it counts. Net-net, "where it counts" is simplicity of deployment, access, administration, monitoring, optimization, troubleshooting, and governance. And what that demands is "virtualization," in the broadest sense of an open big data framework that abstracts away and decouples the external interfaces from the sprawling complexities underneath.

In this context, the "virtualization" framework would require that we evolve big data's modeling, metadata, and "QL" layer (SQL, HiveQL, Cassandra QL, etc.) in the context of a broader infrastructure abstraction that also encompasses Service-Oriented Architecture, Semantic Web, and Representational State Transfer.

Who's doing this – yet?

See previous quick hit on this topic:  


July 11: Frictionless sandboxes? Real-world experimentation thrives in an elastic medium.

Real-world experiments are essentially a continuous campaign of business optimization. And that involves gearing up to develop and refine analytic models of arbitrary scale and scope. Your data scientists cannot truly predict whether the next business challenge will require petabytes of fresh, on-demand, real-time data from 100s of sources from across the wild wild web to stoke their models. They may not need much beyond the low terabytes of relational data batched down from your CRM system.

Brilliant insights can and often do emerge from small data. If you create a culture that insists on scaling your data-science sandboxes to peak of Mount Petabyte on every project, no matter how mundane, you're going to plunge down the crevasse called overkill.

An elastic sandbox lets you put down a solid foundation in the small-data base camp, which is probably where your data scientists will spend most of the time. It's always good to have the option of spinning up a Sahara-scale sandbox in the cloud, with all the storage, processing, and memory your data scientist might ever need. But chances are that you can optimize most of your churn, upsell, experience, and other analytic models with far fewer grains of data under your fingernails.

Respond here

See previous quick hit on this topic


July 12: Sexy statistics? Big data "scores" seem oddly misguided.

Can you attach a "score" to your big data requirements or environment? Sure. Aren't the "3 Vs" intrinsically quantitative? Volume, velocity, and variety are all measurable with yardsticks we generally agree on: bytes of storage, seconds of latency, and number of data sources and formats.

So why does the "big data score" that Mike Gualtieri of Forrester Research has developed seem oddly beside the point? Mike computes the score in a matrix that consists of two axes: the big data Vs on the vertical, and big data activities (store, process, and query) on the horizontal. In each cell, he asks users to rate their organization's ability to "handle" the corresponding scale and activity dimension. He sums up the scores across all cells and voila!

"Voila" what, exactly? He arranges the self-scores on a spectrum from "perfect" to "poor" handling of big data. But there's no business context to this self-evaluation. It doesn't even begin to address the modern business drivers, such as need for dynamic cross-channel marketing optimization, that are spurring investments in Hadoop, stream computing, and other big data approaches. He also doesn't address whether an organization can easily evolve its existing data analytics investments to address any or all of these scale and activity dimensions. Inability to "handle" those big data requirements in your current operations doesn't mean you lack the platforms and other resources in-house to make up for lost time.

Hey Mike: interesting statistic, but still not practical enough for big data professionals to use as a planning and architectural guide.

Respond here

See my prior quick hit on this topic


July 13: Advanced visualization?

Advanced visualization tools are everywhere in the data analytics arena. IBM offers them, as do many other vendors: for heavy-hitting analytics by data scientists, for self-service business intelligence tools for the rest of us, and for an infinite variety of specialized applications.

If a technology is mainstream, at what point does it stop being "advanced" and simply become the mainstream (aka "basic" and "traditional")? The same thought applies to the "advanced" in "advanced analytics." Is this like the "big" in big data? Does the threshold of "advanced" continue to push out over time as new techniques are disseminated and adopted widely?

For me personally, "advanced" visualization is whatever new, cool, immersive, interactive graphical interface I don't have and totally want right now. 3-D? Geospatial? Multsensory? Dynamic? Context-sensitive? Yes, yes, yes. Bring on all the eye candy you can throw my way!

What's "advanced" visualization mean to you?

Respond here


At the end of this fortnight, I'm both relaxed and stimulated. We're preparing for a great second half here at IBM. We have lots of cool things in the works. Stay tuned.