Week's Worth of Koby Quick-Hits June 4th - 8th, 2012

Big Data Evangelist, IBM

Here are the quick-hit ponderings that I posted on the IBM Netezza Facebook page this past week. I went deeper on machine learning, continued my meditation on all-in-memory, put out some more Hadoop thoughts in advance of next week's Hadoop Summit (where IBM's Anjul Bhambhri will speak on convergence of Hadoop and data warehousing), and tried to anchor social sentiment in the nitty-gritty of behavioral propensity. I opened up a new thread of meditation: the value of proofs of concept (POC) in the data warehousing (DW) appliance procurement process.

Here are this past week's quick-hits:

June 4: Sexy statistics? Machine learning is even sexier!

Back in the day, when pocket calculators were brand new, your daily math chores suddenly became far less painful. You just punched in the numbers and got the results lickety-split, without the error-prone pen-and-paper exercise you learned as a schoolchild. When spreadsheets emerged from the primordial goo of personal computing, you marveled at how little you needed to know or care about the nuances of net present value calculation.

Let us now jump to the modern age of rapid statistical analysis performed by powerful servers under the control of your company's esteemed data scientists. Do you really need to care what makes this practical magic come to life? Probably not, if all performs as expected. Your data scientists are probably blasé about it all by now. If you ask them what's sexy now, they'll probably wax eloquent about new frontiers in machine learning ( And if they're doing Hadoop, they'll probably be playing with the Apache Mahout ( machine-learning library, which we're seeing in more big data implementations.

One of the chief reasons your data scientists get into machine learning is that it makes their lives easier and more productive. It allows them to train a model on an example data set, and then leverage algorithms that automatically generalize and learn both from that example and from fresh feeds of data.

Learning from new data? Fresh learning is always sexy. But of course, I'm a geek. See my prior quick-hit on this topic here.

Respond here

June 5: All in memory? Trades off volume and variety for the sake of velocity

I was pondering this this concept of "all in memory." Yeah, we all know it refers to very fast random-access memory (RAM) in computers.  But the concept of keep all of your data in "institutional memory" is that what big data, in the larger context, is all about.

Institutional memory thrives on all the "Vs" of big data. Can we store years worth of historical data? Can we consolidate into that repository the data from all sources? Can we combine structured and unstructured, and support unified search, query, retrieval, analysis, and visualization of it all? Can we discover, process, and deliver it all instantaneously to all users and applications? Can we archive and purge older, unneeded data while retaining a persistent index of what we disposed of?

Today's trendy focus on "all in memory" isn't really about volume: the economics and technology haven't progressed to the point that anybody can persist petabytes cost-effective in RAM. And it isn't really about variety: the server-centric all-in-memory platforms are almost entirely for structured data, as are most of the client-side in-memory tools.

What they do focus on is providing a platform for speed-of-thought data exploration and visualization, which is certainly important, but not central to big data's promise. What is central to today's big data initiatives is massively parallel server-side in-database analytics on data stored on rotating disk.

Clearly, though, the all-in-memory peta-scale in-database platforms and applications will come in a few years' time.

Respond here

See my previous quick-hit on this topic here

June 6: Hadoop uber-alles? No, it's actually MapReduce uber-alles.

There's this persistent misconception that Hadoop is, fundamentally, the Hadoop Distributed File Store (HDFS)--that they're one and the same thing. Not true.

The core of Hadoop is MapReduce. You can find Hadoop environments that don't store data in HDFS, but, rather, in HBase, Cassandra, or even proprietary relational databases. Likewise, you can do without every other non-MapReduce subproject and still have an environment that is, at its heart, Hadoop.

As I've stated in other contexts, what's revolutionary about MapReduce is that it is the industry's first open, vendor-neutral framework for building data analytic models. It is also a runtime environment for executing these models, broken into the primitive "map" and "reduce" functions, in massively parallel processing (MPP) distributed platforms. Check out the MapReduce 2.0 functionality aka YARN (, which is still in alpha, which introduces enhancements for resource management and job scheduling.

Interestingly, we're seeing almost as much uptake of MapReduce among data integration vendors as among analytic modeling and database companies. Mapping and reducing are becoming the core primitives for MPP this new analytics ecosystem.

Respond here

See my previous quick-hit on this topic here

June 7: Social sentiment as valuable market intelligence? Only if sentiment correlates with buying behavior

People's actions tend to belie their words. Often, we're entirely unconscious of our core motivations, or, at best, an unreliable narrator of our life story.

All of which explains why I haven't totally bought into the notion that the words we put on socials, or even the "likes" we click with merry abandon, give us the straight story on people's buying intentions. Think of the old English proverb: "If wishes were horses, beggars would ride."

A marketing intelligence dashboard that only shows social sentiment is skewed to what's "trending" at any point in time. Far more useful are analytics that show the extent to which this awareness and sentiment correspond to people's actual propensity to buy whatever you're selling. For that, you'll need to correlate social sentiment with actual purchase data.

As always, what you need is cash, not the fickle love of the chattering masses. Last time I checked, tweets still aren't legal tender.

Respond here

See my previous quick-hit on this topic here

June 8: Proofs of concept as core appliance acquisition approach?

Netezza built its business not just on data warehousing appliances, but on the proof of concept (POC) as the best approach for demonstrating the value of DW appliances during the procurement cycle. In this, Netezza was very much a pioneer, and the rest of the DW industry followed its lead.

Why are POCs the best approach for buyers to evaluate DW appliances? It's because nothing demonstrates the value you'll receive better than having a DW appliance on your premises loading your data and executing your queries. Then, and only then, can you see the extent to which the appliance offers a price-performance advantage compared with the "roll-your-own" approach, with your existing DW, and/or the competition. In a world where everybody promises "10x" improvements, this is the only meaningful way to vet all the claims on your own terms.

A POC is also a proof of commitment by the vendor. Are they committed to configuring the DW appliance to your specific requirements? Are they committed to sending a sales-engineering teams to your facility for as long as the evaluation requires?  Are they committed to tuning the appliance on site to achieve peak performance? Are they committed to working through any technical issues that may arise? Are they willing to let you kick the tires on your own for a limited period?

If they're not committed on all these fronts, they're not truly to committed to your business. IBM Netezza is, and we've always been known for superior customer satisfaction.

Respond here

At the end of the week, I'm pondering what I'll hit on next week. Often, I've only got a glimmer of a clue what I'll quick-hit. That helps me keep it fresh. I like to surprise myself.

Follow IBM Netezza On:

Follow Jim On: