Week's Worth of Koby Quick Hits: June 25-29, 2012

Post Comment
Big Data Evangelist, IBM

Here are the quick-hit ponderings that I posted on the IBM Netezza Facebook page this past week. I went deeper on the themes of sexy statistics, Hadoop uber-alles, smartphones as big data analytics platforms, and big data's optimal deployment model. And I opened up a fresh topic: frictionless sandboxes.

June 25: Sexy statistics? Influence scores are dangerously addictive.

Influence is the new black. Everybody, it seems, boasts of their influence on social networks. Klout has built its business on helping people track their influence continuously with a metric of its own concoction. If you subscribe to its service, it measures your influence based on the incidence of Twitter retweets and mentions; Facebook comments, wall posts, and likes; LinkedIn comments and likes; Google+ comments, reshares, and plus-ones; and so on.

Influence scores are addictive because they appeal to our vanity and very human desire for social status. However, their validity is questionable. True influence is your ability to sway others' minds and actions. If all you're doing is prompting others to exclaim "amen" or its equivalent to your next statement, that's not influence. It's simply an opportunity for others to cite third-party corroboration for something they already believe.

True influence is your ability to transform your social presence into a sort of recommendation engine. How many people buy something, or decide not to buy it, based on your recommendation, or, at the very least, on your example? How many decide to vote for one candidate over the other based on your endorsement?

Few people truly influence others' behavior based entirely, or even largely, through their social media activities. Your offline presence – as a family member, real-life friend, neighbor, or colleague – probably has a more substantial influence on others' words and deeds.

Klout and similar scores are the equivalent of "page impressions." These statistics primarily measure the number of social eyeballs exposed to your words. You're generating impressions, but not necessarily impressing anybody.


See my prior quick hit on this topic


June 26: Hadoop uber-alles? Disappointed there's no grand vision for Hadoop evolution

The recent Hadoop Summit was exciting. The big data industry continues to innovate at a feverish pace. One of the key summit themes was the maturation of Hadoop into the nucleus of the next-generation data warehouse (DW). At IBM, we see that as an inevitable trend, and I presented our vision of the "Hadoop DW."

Beyond the various discussions at the summit that corroborated this trend, a dominant theme was the core Hadoop codebase's ongoing evolution. One of the key Apache Hadoop committers, Hortonworks, discussed the key components of Hadoop 2.0, the alpha for which was released in May. Key new Hadoop features include high availability and federation in HDFS and support for alternate programming frameworks in MapReduce.

Those are much-needed features, but, truth be told, I was bit disappointed.

For starters, these new Apache Hadoop capabilities were presented without any unifying rationale. Nobody stepped forward to present a coherent vision for Hadoop's ongoing development. When will development of Hadoop's various components – MapReduce, HDFS, Pig, Hive, etc. – be substantially complete? What is the reference architecture within which these other Hadoop services are being developed under Apache's auspices?

Also, nobody discussed where Hadoop fits within the growing menagerie of big data technologies. Where does Hadoop leave off and various NoSQL technologies begin? What requirements and features, best addressed elsewhere, should be left off the Apache Hadoop community's development agenda?

And nobody called for a move toward more formal standardization of the various Hadoop technologies within a core reference architecture. The Hadoop industry badly needs standardization to support certification of cross-platform interoperability. This, plus the continued convergence of all solution providers on the core Apache Hadoop codebase, is the only way to ensure that siloed proprietary implementations don't stall the industry progress toward widespread adoption.

Big data standards must go well beyond Hadoop to encompass the full sprawling tableau of technologies. The larger picture is that the DW is evolving into a virtualized cloud ecosystem in which relational, columnar, and other database architectures will coexist in a pluggable big data storage layer alongside HDFS, HBase, Cassandra, graph databases, and other NoSQL platforms. These specifications will form part of a broader, but still largely undefined, service-oriented virtualization architecture for inline analytics. Under this paradigm, developers will create inline analytic models that deploy to a dizzying range of clouds, event streams, file systems, databases, file systems, complex event processing platforms, and next-best-action platforms.

In my opinion, Hadoop's pivotal specification in this larger evolution is MapReduce. Within the big data cosmos, MapReduce will be a key unifying development framework supported by many database and integration platforms. Already, IBM supports MapReduce models in both our Hadoop offering, InfoSphere BigInsights, and in our stream-computing platform, InfoSphere Streams.

The ability to plug alternate programming frameworks into MapReduce, under Hadoop 2.0's "YARN" specification, will make this vision possible.


See my previous quick hit on this topic


June 27: Smartphones as big data analytics platforms? Gadgets are becoming valuable data sources

OK, maybe you and I won't use our personal smartphones as data warehouses anytime soon. Even if storage economics improves to the point that dirt-cheap palm petabytes are mainstream, your employer may be loath to put all that sensitive data into devices that are lost and stolen with nauseating regularity.

But it's quite clear that smartphones will become extremely important sources of the data pouring into Hadoop, NoSQL, and other big data platforms. As everybody knows, the trend is toward organizations moving most of their transactional, productivity, and e-commerce applications to mobile clients such as smartphones and tablets.

Your ability to optimize service delivery increasingly depends on your ability to capture, correlate, and analyze massive streams of gadget-sourced data at the device, application, and user levels. Every transaction, interaction, event, signal, ambient, behavioral, geospatial, and other datum that you can acquire from employee and customer gadgets will crunched by Hadoop and other ravenous big data platforms. Your ability to personalize, optimize, geolocate, and secure your mobile applications will depend critically on big data analytics.

The gadgets themselves may evolve into big data platforms. But, even if they don't, big data platforms will starve if not fed continuously by the gadgets of the world.


Read my previous quick hit on this topic


June 28: Big data's optimal deployment model? Revisiting the 3-tier topology.

If the "Hadoop DW" is the future of enterprise big data, what's its optimal topology? Does the time-proven 3-tier topology of enterprise DWs apply in this brave new world? Does it make sense to partition your Hadoop clusters into separate specialized tiers of access, hub, and staging nodes?

I'm not at all sure.

As with traditional DWs, centralization of all Hadoop DW functions onto single clusters – the "one-tier topology" – has its advantages, in terms of simplicity, governance, control, and workload management.

Hub-and-spoke Hadoop DW architectures become important when you need to scale back-end transformations and front-end queries independently of each other, and perhaps also provide data scientists with their own analytic sandboxes for exploration and modeling.

However, the huge range of access points, applications, workloads, and data sources for any future Hadoop DW demand an architectural flexibility that traditional DWs, with their operational BI focus, have rarely needed.

In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured.

In the hub tier, you may need disparate clusters configured with different underlying data platforms – RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, etc. – and corresponding metadata, governance, and in-database execution components.

And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.

Yes, more tiers can easily mean more tears: the complexities, costs, and headaches of these multi-tier hybridized architectures will drive you toward greater consolidation, where it's feasible. But it may not be as feasible as you wish.

The Hadoop DW will continue the long-term trend in DW evolution: movement away from centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The Hadoop DW itself is evolving away from a single master “schema” and more toward database virtualization behind a semantic abstraction layer. Under this new paradigm, the Hadoop DW will require virtualized access to the disparate schemas of the relational, dimensional, and other constitute DBMS and other repositories that constitute a logically unified cloud-oriented resource.

Our best hope is that the abstraction/virtualization layer of the Hadoop DW architecture will reduce tears, even as tiers proliferate. If it can provide your big data professionals with logically unified access, modeling, deployment, optimization, and management of this heterogeneous resource, wouldn't you go for it?


See my previous quick hit on this topic


June 29: Frictionless sandboxes?

The other day, a colleague discussed the growing need for data scientists to quickly provision – and then just as quickly de-provision – very large big data analytic sandboxes. Marketing campaigns are an obvious use case where this frictionless sandbox provisioning would come in quite handy. What if you can spin up petabyte sandboxes in the cloud to support fast development of MapReduce churn and upsell models to drive your next campaign? What if you could rent out a piece of a big data cloud – storage, processing, and other resources – for the duration of that campaign, and either purge that data when the campaign is over or archive subsets to cheap cloud storage when no longer needed for regression analysis and scoring?

What do you think? Is this degree of on-demand sandboxing a key requirement in your big data initiatives?



I'm gearing up for a short vacation. You will see quick-hits from me on the 5 days I'm working over the next 2 weeks: Mon-Tues July 2-3 and Wed-Fri July 11-13.

Follow IBM Netezza On:

Follow Jim On: