Big Data Evangelist, IBM

Here are the quick-hit ponderings that I posted on various LinkedIn big data discussion groups this past week. I went deeper on information glut, Hadoop uber-alles, experience optimization, sexy statistics, and complex event processing:

August 20: Information glut? You don't need lots of data to do big data

You don't always need big amounts of data to do big data right. In fact, data glut and overcapacity may be counterproductive at many stages in the big data lifecycle.

No, I haven't lost my mind. The point of big data is to extract deeper intelligence from data while eliminating the scale constraints that have frustrated traditional advanced analytics initiative. To do big data right, enterprises must have the following life-cycle approach:

  • Big data on-ramp: As I've emphasized many times, the core big-data architectural principle is to give yourself headroom to scale out massively along all the Vs as your analytical needs grow. You may not need that capacity immediately, but you know you will almost certainly need it in the coming several years. The popular focus on sheer scale of some big data initiatives distracts from the fact that even they had to start somewhere. That "somewhere"--the big data on-ramp--is usually in "small data" territory. More often than not, the big data on-ramp will be your existing data analytic infrastructure: data warehouse, data mining, business intelligence, and so forth. As your needs evolve in all of these areas, you will scale these investments into big data territory. Even now, your data scientists may be playing with BigInsights Hadoop clusters in the low terabytes, confident that there are no impediments to elastic scaling when their MapReduce, machine learning, content analytics, and other models must be put into full production.
  • Big data throttle: As you expand the amount and diversity of data stored and processed in your big data platform, you'll need to explore approaches to throttle the volumes to a cost-effective, manageable level. You'll need to explore profiling, sampling, purging, archiving, deduplicating, and other approaches to keep the core platform(s)--be they Hadoop, MPP EDW, NoSQL, or whatever--from filling up with junk data that has minimal downstream value. Likewise, you'll need tools to throttle and allocate workloads and resources across these platforms to maximize performance and ensure service levels.
  • Big data off-ramp: Though it's tempting to think that your core big data platform is your "one size fits all" option forevermore, that's dead-end thinking. Given the sheer diversity of specialized analytic databases and the pace of innovation in this area, you may find down the road that some data types and analytic jobs are best handled on smaller-scale workload-optimized platforms, either of the same technology (e.g., do BI-type jobs on a Hadoop cluster running HBase rather than the core HDFS-based cluster) or different technology (e.g., do machine-learning semantic analysis on a DBMS-based RDF triple store rather than a Hadoop platform).

In a nutshell, big data is all about matching the data analytics platform scale to the business challenges you're trying to address, not about treating one massive scale as the panacea for all projects.

See previous quick-hit on this topic.


August 21: Hadoop uber-alles? Cluster management tooling must keep up

If you've implemented Hadoop in a production environment, your clusters will continue to scale along all dimensions—volume, velocity, and variety—and you will be running more mission-critical applications on the platform. The production-ready maturity of your Hadoop deployment depends on your ability to harden it for robust round-the-clock availability, performance, and manageability.

Hadoop cluster management needs to be central to your big data initiative, just as it has been in your enterprise data warehousing  environment. Cluster management demands strong tooling that is either baked into your existing distribution or sourced from other vendors and integrated tightly into whatever distribution, including open-source Apache Hadoop, you have deployed.

The core open-source Apache Hadoop stack is still a bit deficient in many of these areas, but it's evolving in this direction. Be aware that purely open-source Hadoop lacks many critical cluster management features that span all or most core subprojects and components. In particular, the core Apache codebase is still weak on functionality to ensure robust availability, reliability, capacity management, low-latency optimization, workload management, job scheduling, security, and governance. Also, not all Hadoop solution vendors provide a full range of internally developed cluster management capabilities for robust deployments. IBM does, though. More on this in an upcoming article I'm publishing soon at

See previous quick-hit on this topic.


August 22: Experience optimization? Down deep, it's not a big fancy formula.

When we speak of customer experience, my head spins with thoughts of James Joyce's landmark novel "Ulysses." This forbiddingly strange book is written in a "stream of consciousness" style that attempts to present the subjective experiences of the main characters through run-on interior monologues. Much of this work is in the form of sometimes-naughty puns, allusions, asides, and remarks that mirror the protagonists' deepest subconscious desires.

Given that none of us truly has conscious control over the majority of what flows through our own heads and hearts--or other people's, for that matter--how exactly does one optimize an experience? On any given day, the best one might hope for is that a given experience "made your day," though it may have been entirely unexpected, unplanned, unprecedented, and unlikely to recur. If you're a commercial organization and hope to continually make the days of all of your customers without fail, you must know people better than they know themselves and somehow have the resources to grant them their fondest desires (while also turning a profit).

Absent divine powers, the best you can do is engage your customers continually through human contacts across all or most channels. The formula for experience optimization isn't a formula. It's simply standing ready to listen to customers and respond to them in every way within your power to make every moment a little lighter and brighter for them.

See previous quick-hit on this topic.


August 23: Sexy statistics? Data warehousing simplicity is a surprisingly complex metric

Simplicity is sexy, as any familiarity with design best practice will show. The "less is more" imperative makes great sense on lots of levels, and it's the heart of IBM Netezza's longtime value proposition in the data warehouse market.

However, simplicity--unlike speeds and feeds, another DW appliance differentiator--has no clear metric. How do you measure simplicity? Does it even make sense? If it makes sense, is it beside the point? Isn't simplicity in the eye of the beholder?

The other day, it occurred to me that we can measure simplicity of a DW appliance along three main dimensions. If you're a user evaluating a commercial DW appliance against your own current practice, or against a competing product, ask yourself the following:

  • The "fewers": Does the appliance allow me to master its architecture, features, and tools in fewer days or hours? Does it require fewer setup, configuration, and migration steps? Does it take fewer hours to put into full production? Does it enable diagnosis and correction of problems in fewer minutes or seconds? Does it require fewer manual choices by architects, installers, and administrators? Does it have fewer requirements for manual tech support?
  • The "ones": Does it come from one vendor? Is there one sales point of contact? Is it one SKU with one price and one maintenance plan? Can I acquire it through just just one purchase order? Is it shipped as one turnkey appliance with all requisite hardware and software components preconfigured and optimized? Can I set it up and configure it in just one session? Is there one technical support "throat to choke" for all issues? Is it manageable from the bare metal up through the database and other software through one well-designed console interface?
  • The "nones": Is there no software installation? No indexes? No physical tuning? No hardware upgrade requirements? No failed drives that are not automatically regenerated? No bad sectors that are not automatically rewritten or relocated? No manual storage administration? And so forth.

In a sense, simplicity has many moving parts that must be integrated seamlessly for it all to hang together into a powerful platform. All of the quick-deploy quick-value benefits of appliances derive from the simplicity features that these metrics-- call them checklist items, if you will--embody.

See previous quick-hit on this topic.


August 24: Complex event processing? Conversation-driven human decisions are the most complex events of all

Nuanced conversation management in all its manifestations is the key to business success in the age of multichannel customer relationship management. In a fully conversation-driven multichannel environment, enterprises are deploying social media analytics and other infrastructure to automatically monitor, analyze, correlate, and participate in various internal and external conversations. This points to development of event-driven customer-facing environments.

Conversations are a sort of complex event that includes lower-level events such as tweets, posts, status updates, instant messages, and the like. In a business world adopting social networking internally as well as externally, conversations are live, dynamic, evolving events that have a direct impact on decisions and responses. In a social-centric conversation management environment, every new contribution to a never-ending conversation should be published instantaneously to all engaged parties, both internal and externally. This scenario would enable customers to receive immediate feedback, guidance, and resolution from the best minds in your internal and external value chains.

Within your organization and across your value chain, the conversations are often extremely complex and open-ended. How effectively you respond to customer issues depends on how well you can close the action loop as you manage the internal events known as "decisions." A key enabler for event-driven decision making is the concept of collaborative and workflow applications that are real-time, rule-triggered, and continuously refreshed from event data you might be feeding in from, say, IBM InfoSphere Streams.

Clearly, collaboration adds another level of complexity to an organization's ability to respond to events. Where real-time event-triggered decisions are concerned, collaboration could be a showstopper, if we don't watch out. Collaboration adds complexity, hence the potential for more decision latency. But collaboration also builds buy-in, and, as such, can be worth the muss and fuss of versions, reviews, revisions, and provisional decisions.

See previous quick-hit on this topic.



At the end of the week I'm drafting next week's quick-hits: four established themes and one new one. I've already got so many themes in play that I can easily spend the next several weekly cycles just extending theme all, and possibly cross-fertilizing. We'll see.