Week's Worth of Koby Quick Hits: July 16-20, 2012

Big Data Evangelist, IBM

Here are the five quick-hit ponderings that I posted on the LinkedIn Big Data Integration  group this past week as discussion starters. I went deeper on the themes of information glut, advanced visualization, social sentiment, and Hadoop. And I opened up a fresh topic: complex event processing:


July 16: Information glut? Don't kid yourself. You'd never tolerate the other extreme.

People love to gorge themselves on the hip addictions of the modern day. The street scene of the 90s was everybody walking with Starbucks cup in hand, blithely self-caffeinating on Seattle's venti-est. The equivalent scene of the '10s is everybody clutching a smartphone, slurping up every fresh tweet or text that springs forth.

If anything, we pride ourselves now on how much information we can cram down our mental maws without choking. What's hip these days is having massive bandwidth on the personal level, or, at least, appearing to. We gripe about the too-muchness of modern life while plunging ourselves ever deeper, confident that we will always be able to surf our merry way over it all. The people who complain about "too much information" are generally the ones who complain about how damn busy they are. These are boasts veiled as complaints.

But take away somebody's access to the too-muchness and you'll never hear the end of it. The notion of "glut" only makes sense if you have a clear sense of how much information your greedy head truly needs to be content. Never quite satisfied with the digital cornucopia, we connect to every vaguely interesting source, build every potentially useful report, construct every reasonably enlightening visualization, subscribe to every new stream, adopt every available app or gadget to access it all, and so on.

For each of us, the day will come when we rid ourselves of the unnecessary information, sources, channels, gadgets, and the like. And it won't be become we are full up, but because we have played with everything and found our own specific personal blend. We'l dump the rest because it's empty of whatever information we now regard as most meaningful in our lives.

Whatever bloat we once felt was ephemeral, a product of our personal experimentation with the overwhelming richness of the modern infocopia.


See previous quick hit on this topic   


July 17: Advanced visualization? What's advanced is every possible visualization, or just one juicy one

Maybe I've become a bit jaded. I've come to the opinion that visualizations are a dime a dozen. Some seem more "advanced" than others simply because they're more complex and unfamiliar than the ones--such as line charts, histograms, and scatterplots--that we take for granted. Nevertheless, we marvel at creative new ways to visualize quantitative information or bring deep scientific patterns to light.

When it comes to business intelligence and analytic tools, most of the leading products, including IBM Cognos, offer an impressive range of visualizations. In fact, BI users now take for granted the fact that they can render data in a staggering range of visualizations. They also have come to expect that the range of visualizations will continue to grow, far outstripping their ability to determine which rendering is best for which type of data or analysis.

Today's situation with advanced visualization is analogous to the introduction of user-configurable fonts into wordprocessing programs a quarter-century ago. At first, the unfamiliar ones seemed advanced in their funky way. But after futzing for a while with Bookman Old Style vs. Lucida Bright vs. Trebuchet MS, most of us have just defaulted to Times New Roman or Verdana in almost every context. We're happy to have cool options that we almost never use.

I've come to think that the most advanced visualizations are the ones so plain in their utility that they don't call attention to their own cleverness. I live in the Washington DC area and can point to a perfect example: our Metrorail map. This is a complex rendering containing lots of information, but it's so stunningly well-composed that you never feel for a moment you can't find where you are or need to go, or by what route.

I don't want alternate visualizations of the Metrorail system. That would confuse the heck out of me.


See previous quick hit on this topic


July 18: Social sentiment as valuable market intelligence? Less valuable when your competitor has access equivalent to yours

Monitoring the socials is "table stakes" to compete in the world of Smarter Commerce. It's not, at its most basic level, differentiating when you and all your competitors are doing it. Why?  Because if everybody's tracking the same feeds with the same tools and the same models in roughly the same way, the best you have is a common understanding of customer sentiment, awareness, and propensity. Nobody has a deeper dive into the customer's mind.

From a competitive standpoint, your chief differentiation will be in combining social sentiment with rich feeds of data to which only you have privileged access--by which I mean the deep profile, transaction, and interaction data you have on your own customers. It's for this reason that your investments in Hadoop and stream computing as big data platforms for sentiment monitoring must be accompanied by deep integration with the customer data in your enterprise data warehouse, data marts, and other databases. That's why we stress that enterprises must architect their big data platforms to include Hadoop (e.g., IBM InfoSphere BigInsights), stream computing (e.g., IBM InfoSphere Streams), and DW (e.g., IBM Netezza, IBM Smart Analytics System) as core components.

If you can continue to deepen your understanding of your own customers' sentiment, you can improve loyalty, upsell, cross-sell, campaign effectiveness, and other initiatives that have a clear bottom-line impact. Just as important, you can strengthen the bond with your customers so that rivals are far less likely to win them over.


See previous quick hit on this topic   


July 19: Hadoop uber-alles? Then there should be a standard industry performance benchmark

The Hadoop market has matured to the point where users now have plenty of high-performance options, including IBM InfoSphere BigInsights. The core open-source Hadoop stack is common across most commercial solutions, including BigInsights. The core mapping and reducing functions are well-defined and capable of considerable performance enhancement, leveraging proven approaches such as Adaptive MapReduce, which is at the heart of BigInsights. Customers are increasingly using performance as a key criterion to compare different vendors' Hadoop offerings, often various sort benchmarks to guide their evaluations.

So why are there still no standard industry-wide performance benchmarks for comparing performance of all core operations on Hadoop clusters? Perhaps it's for the same reason that we still have no formal standards in the core Hadoop stack: it's an open-source community that culturally resists handing governance to a de jure body. Regardless, customers are demanding that the industry adopt a clear, consensus approach to performance claims in some core operations: NameNode operations, HDFS read/writes, MapReduce jobs (map, reduces, sorts, shuffles, and merges), compression/decompression, and so on.


See previous quick hit on this topic


July 20: Complex event processing?

Complex event processing (CEP) is vague catch-all of a real-time paradigm that now fits snugly into big data, thanks to the "V=velocity" dimension.

Traditional CEP enables analysis on discrete business events; executes rules-based correlations across event types; supports only structured data types; and is optimized for modest data rates.

At IBM, we've long provided CEP technologies, but have expanded their big data scale and scope so extensively. Like any CEP platform, IBM InfoSphere Streams rapidly ingests, analyzes, & correlates information as it arrives from real-time sources. But we take it to the next level of big data scale and sophistication:

  • Handle simple and extremely complex analytics with agility
  • Scale for computational intensity
  • Supports wide range of relational and non relational data types
  • Analyze continuous, massive volumes of data at rates up to petabytes per day
  • Perform complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that is relevant to your business
  • Leverage sub-millisecond latencies to react to events and trends as they are unfolding, while it is still possible to improve business outcomes 
  • Adapt to rapidly changing data forms and types
  • Seamlessly deploy applications on any size computer cluster
  • Meet current reaction time and scalability requirements with the flexibility to evolve with future changes in data volumes and business rules
  • Develop new applications rapidly that can be mapped to a variety of hardware configurations and adapted with shifting priorities
  • Provide security and information confidentiality for shared information
  • Offer a wide range of accelerators for diverse real-time big data applications in many industries.

And, oh yes, Streams integrates out of the box with the Hadoop product, BigInsights, in the IBM portfolio. Whereas traditional CEP is a stovepipe silo that is separate from users' big data platforms, IBM has made it integral.

That's why, in lieu of "CEP," we've chosen a broader term, "stream computing," to describe where technologies like Streams fit into the big data picture.

Is that the right term? Tell us what you think.



At the end of the week, I'm looking forward to further daily postings to LinkedIn. I want to get your feedback, start discussions, and stimulate forward thinking in the industry on all things big data.