Unstructured Data & Structured Data - Complementary, not Divisive
2011 finished with the European leg of the 2011 Global Netezza Roadshow. We shared product updates, customer case studies and just got together with our community in Paris, London, Milan and Frankfurt. A year ago I wrote about a great restaurant in Milan that I visited when I was there for the Netezza roadshow in 2010 and, true to form, I ate well there on this trip too - especially the place that has a chocolate board as well as a cheese board!
The European VP of our customer Catalina Marketing presented at the Milan Roadshow; she told us about their biggest table - 1.2 trillion rows. That is staggering. I often tell the Catalina story – how they do real-time scoring of your ‘basket’ at the supermarket checkout, and print out a personalized money-off coupon on your till roll. It’s a great example of big data analytics in action. But it is a story of relational big data (very big - Catalina have over 2 peta-bytes of data under management on IBM Netezza appliances).
In Milan, as at our London roadshow, I was presenting the IBM vision for managing big data that isn’t only relational. Many of the case studies from the niche players for unstructured big data (usually internet-sourced data) are isolated from the corporate data warehouse; sometimes they appear to be in competition with the corporate (relational) data warehouses. It has struck me there’s an element of the OODBMS zealotry amongst some of this crowd. And it really isn’t helpful for corporate data architects trying to get a grip on the implications of this new world, trying to forge a big data vision for their employers. Those architects want all their data assets to be exploited for maximum value, but the old spectre of data silos re-emerges with the prospect of a sprawl of unconnected data marts some structured, some unstructured. This prospect probably isn’t even helpful to the NoSQL proponents who aren’t going to destroy SQL for analytics, and so will have to co-exist with it.
But one of the plus points of being in IBM is that I have the widest palette of data management tools available to me in-house to paint my picture of big data management. I can blend real-time streaming data capability, Hadoop-based map-reduce capability and relational, analytic appliances. Which is just as well, because wherever I look in the IBM customer base at big data stories, if there is unstructured data being used it is almost always being re-used, to deliver additional insight, in conjunction with corporate relational data assets. Whether it is PNNL processing streaming smart grid data then re-analysing it at rest to build predictive models of grid behavior or sentiment analysis of retailer web presence being subsequently combined with customer purchasing history to predict buyer intentions, the pattern emerges of the multiple technologies rendering multiple insights, often using internet and corporate data sources together.
So to return to the beleaguered corporate big data architect, I think that there is no single technological silver bullet for all your analytic workloads. There’s not even one exclusive technology for each of your Internet data sources. To deliver all the analytic workloads on all the data sources we’ll be moving data around, all in one logical data warehouse, combining it to support the different use cases that each deliver their own value to the organization. But that doesn’t mean I’d support an emergent adhocracy of polymorphous data marts. A sprawl of data marts has already proved problematic for many organizations. What I think it means is that the successful corporate big data architectures will combine use case-driven delivery on an infrastructure that can put the right data, in the right place, in the right combinations to support those use cases.
And yes, I am talking Smart Consolidation again, and yes I would say fair enough if you raise the Mandy Rice-Davies objection that what else would an IBMer say when his employer is the only player in the market who has all the products to build such architecture. All I’d say in my defense is that if a friend was considering joining some big data niche vendor, the first thing I would recommend they ask, as a measure of their prospective employer’s credibility and their prospect of longevity, would be ‘what’s your strategy for co-existence with the relational world and other big data niches?’
For More Information:
 credit to Gartner Group for the concept of logical data warehouse
 yup, that is possibly one of the top ten bollox sentences you’ve ever read, unless you’ve read much post-structural analysis, in which case it’s not even on the radar. But i couldn‘t resist it, and anyway it sounds better than ‘lots of data marts springing up around the enterprise on different technologies, each addressing one use case without reference to any over-all architecture’.