A Week of Koby's Quick Hits - Aug. 6-10

Big Data Evangelist, IBM

Here are the quick-hit ponderings that I posted on various LinkedIn big data discussion groups this past week. I opened up one new theme–Big Media (which I'd introduced a few weeks back at this IBM big-data-relevant site) –and extended my existing discussions of peta-governance (going beyond what Tom Deutsch and I discussed recently in the IBM Data Management magazine), prediction markets (ideas stimulated by LinkedIn discussions), decision scientists (a shortened version of a blog that will appear imminently at the new IBM big data hub), and all “in memory” (wherein I split hairs on the definition of "all"):

August 6 – Peta-governance? Controlling the explosion of big data models will prove difficult

Big data rides on a never-ending stream of new statistical, predictive, segmentation, behavioral and other advanced analytic models. As you ramp up your data scientist teams and give them more powerful modeling tools, you will soon be swamped with models. How can you possibly govern the exponentially growing pool of models, many of which are key intellectual property, with tight controls that you can scale without strangling the creativity of your data scientists by making them jump through frustrating bureaucratic hoops?

Big data analytics demands governance–let's face it, some level of repeatable bureaucracy–if it's designed to produce artifacts that are deployed into production applications. For example, how many in-production churn models are you managing across your various channels? What's the latest approved version of each? How well does it fit the latest customer data pulled in from the data warehouse and other customer data stores? Was it promoted to production in a controlled fashion after being scored against the latest customer data and compared with various challenger models? Which of your churn-knowledgeable data scientists and subject matter experts, if any, takes responsibility for approval of the latest and greatest model at any point in time?

Too many organizations implement haphazard model governance within and among their disparate big data initiatives. The sheer scale of your big data science will make the governance imperative more urgent. Your data science teams will produce:

  • more and more models (segmentation, churn, upsell, cross sell, experience optimization, anti-fraud, social graph analysis, etc.)
  • developed in more and more tools and languages (R, SPSS, MapReduce, Java, SAS, etc.)
  • with more and more variables (interactions, behaviors, transactions, clicks, locations, demographics, psychographics, socioeconomics, semantics, etc.), fed from
  • fed with data from more and more sources (social media, CRM, OLTP, data warehouses, billing, sales and marketing, etc.)
  • scored and validated according to more and more schedules (as-needed, batch, near-real-time, real-time)
  • driving optimizations in more and more business processes (customer-facing, such as sales, marketing, call centers, self-service portal, etc., and back-office, such as order fulfillment, manufacturing, and finance)

Peta-scale big data governance demands tools such as IBM SPSS Collaboration and Deployment Services to automate standard processes for life-cycle management of models created in and imported from diverse tools. To avoid fostering an unmanageable glut of myriad statistical models, your big data sandboxing environment should support strong life-cycle governance of models and other artifacts developed by your data scientists, regardless of what tools they use. Key governance features include check in/check-out, change tracking, version control, and collaborative development and validation. Your sandboxing platforms and modeling tools should ensure consistent governance automation, and managed collaboration across multidisciplinary teams working on your most challenging big data analytics initiatives.

See previous quick-hit on this topic and respond here.

August 7 – Big Media?

Social media are powering the Big Data revolution. Much of the impetus for this new paradigm has come from the onrush of user-generated social chatter–tweets, status updates, and the like–which provide a rich vein of market intelligence for marketing, sales, brand, and other consumer-facing professionals. Most of this new data is unstructured text that, via the magic of natural language processing, can reveal trends in customer awareness, sentiment, and propensities.

This very same trend is carrying the seeds of the next revolution, which we might think of as “Big Media.” In this new order, streaming media will power entertainment, advertising, marketing, education, music, community, and practically every other aspect of online culture. In fact, the very phrase “social media” alludes to the inflection point we may already have crossed. Following the pioneering path of YouTube, Twitter and Facebook have already evolved to support user-posted streaming media as an integral element of user experience.

This is a bellwether of things to come. As the younger generation abandon traditional media delivery channels–such as cable TV, over-the-air radio, and theatrically presented motion pictures–the Big Media revolution will snowball.

Rest assured that the social-driven big data revolution will not wane. Instead, big data will be the topsoil for the new plateau of Big Media. Analytic-driven personalization of content delivery–which is at the heart of social-driven big data –will underpin this new order. Big data’s social interaction model will permeate Big Media. And more streaming content will be generated directly by users or shared by them within media-centric channels, such as “social TV.”

How will the coming era of Big Media take big data to the next level? One way of looking at it is that the “3 Vs” are evolving into the “3 Zs”:

  • Zettabytes (eventually) of rich streaming media objects outstrip mere petabytes of structured, semi-structured, and unstructured text
  • Zero-latency (almost) multi-streaming media feeds gain priority over mere low-latency event streams of social, sensor, geospatial, log, and other data types
  • Zillions (metaphorically) of on-demand media streaming options become the centerpiece of the new world online culture

Another way of looking at this trend is that “data in motion” (i.e, media such as full-motion video, on-demand entertainment, and streaming Web radio) will dominate “data at rest” in the digital media arena. Clearly, the resource requirements for Big Media–storage, processing, bandwidth, etc.–will be an order-of-magnitude greater than with big data (which itself is an order-of-magnitude more resource-ravenous than traditional data management infrastructures).

Hadoop’s elephant, the recognized symbol of big data, will appear Lilliputian beside whatever monstrous organism we choose to represent the earthshaking advent of Big Media. Some may favor the blue whale, others the tyrannosaurus, still others the less threatening but undeniably mighty redwood. Personally, I’m leaning toward Clifford the Big Red Dog. Scratch that–copyrighted character.

How about Paul Bunyan?

Respond here

August 8 – Prediction markets? Markets won't succeed if offer no better predictions than in-house

Companies everywhere are strengthening their internal predictive muscles. However, organizations won't necessarily source these capabilities from external parties–such as online cloud/SaaS-based prediction markets–if they have highly trained and well-paid cadres of analysts on the payroll.

Companies who wish to spin-off their internal predictive analytics capabilities into prediction-market services must take the following steps:

  • Establish predictive analytics and data science centers of excellence
  • Staff those centers of excellence with data scientists and subject-matter experts (SMEs)
  • Equip them with powerful analytics tools that support collaborative, self-service, visual, exploration of big data
  • Bake into those tools the ability to publish, aggregate, syndicate, pool, mashup, and advertise predictions from diverse data scientists and SMEs
  • Use your core group of data scientists and SMEs as the nucleus for a big data scientist expertise marketplace along the lines of

This sort of spin-off only makes sense if you are already primarily a consulting firm that has deep predictive expertise and insight into at least one industry, and preferably more. If you were, say, in one specific vertical such as energy & utilities, you wouldn't want to spin off your internal competency into a service that your competitors might access.

It might also make sense if you perform a trusted third party service for a particular industry, such as hosting a B2B supply chain or providing transaction brokering service. High-quality predictions on supply, demand, logistics, prices and the like may be something that only a firm with an industry-wide view–and the big data to back it up–can produce.

But the shared service must always compete against customers' own internal capabilities. Prediction markets are only valuable if they give you good stuff you can't scrounge up on your own.

See previous quick-hit on this topic and respond here.

August 9 – Decision scientists? Game theory perspectives valuable in modeling next best actions

Customer engagement is a bit of a game, because, deep down, it's a form of haggling and bargaining. Game theory is a modeling discipline that focuses on strategic decision-making scenarios. It leverages a substantial body of applied mathematics and has been used successfully in many disciplines, including economics, politics, management and biology. There has even been some recent discussion of its possible application in modeling customer-engagement scenarios to improve loyalty, upsell and the like.

Customer engagement modeling is a largely unexplored frontier for game theory. The literature on this is relatively sparse right now, compared to other domains where game theory's principles have been applied. Nevertheless, the core game-theoretic concepts translate over to customer engagement quite well. For example, the concept of a "many-player game" involves an arbitrary, but finite, number of customer participants who consider each other's actions when deciding how they themselves should act. This applies to any B2C engagement that involves making differentiated offers contingent on how the consumer's "friends and family" have bought. It is also relevant in any online engagement where the recommendations or activities of "influencers" (friends, family, experts, Oprah, etc.) have some bearing on the choices we make. And it describes, to some degree, the business model of social-shopping communities such as Groupon.

At first glance, data scientists may consider themselves fish out of water when it comes to applying game-theoretic approaches to customer engagement. Methodologically, game theory looks at discrete variables–actions, events and outcomes–rather than the continuous variables that are the heart and soul of data science's core discipline of regression modeling. In addition, game theory assumes that we should model engagements as interactions among rational decision makers–individuals and businesses–that can have deterministic outcomes, rather than the probabilistic outcomes associated with mainstream data science.

Down deep, game theory is the realm of what some have called "decision science," rather than data science in its traditional sense. Nevertheless, it provides a valuable set of approaches for behavioral analytics. Game theory can deliver rich insights, especially when data scientists use it to enrich and extend the propensity, experience, and other behavioral models at the heart of customer engagement.

See previous quick-hit on this topic and respond here.

August 10 – All in memory? Depends on what you mean by "all"

All-in-memory analytics is an industry mania right now, and in any mania, we tend to lose perspective on what truly matters.

One area where I think everybody's getting hung up is in the perception that "all" refers to persisting "big data" in memory. In other words, most people have their minds on some futuristic scenario wherein we might be able to hold entire Hadoop clusters in distributed RAM and do lightning-quick MapReduce model runs and interactive exploration against ballooning petabytes.

But, apart from the most elite data scientists, how many people truly need to analyze petabytes at the speed of thought? In the context of "all in memory," we should interpret "all" as referring to the entire working set of data accessed by any application. What truly matters is the ability to rapidly extract all relevant insights from the core data set of interest.

What is "all" the core data that you need for your analysis? Most of business analytics are against data warehouses, marts, cubes, and other databases that store less than 10 terabytes (TB). This, conveniently, is well within the capacities of most in-memory analytics platforms on the market today. Likewise, most in-memory analytics clients hold far less than 10TB in RAM, and their power users are not noticeably complaining.

If you're a data scientist, you may build a starter sandbox with 10-25TB of priority core data, in which case your definition of "all" may grow over time as your investigations call for more sources. Those are the sorts of in-memory use cases where the frontiers of all-in-memory will push closer to big-data territory soonest. The technology and economics of RAM will continue to improve. The price of solid-state persistence will continue to drop every 18 months by 30 percent, bringing petabyte memory clouds more feasible and budget-friendly over the coming decade. And as next-generation CPUs, with 1000s of cores, expand the addressable memory, we're likely to see dozens of TBs of RAM per server as a mainstream technology in that same time frame.

See previous quick-hit on this topic and respond here.

At the end of the week, I have drafted 4 blogs (this one included) that are queued up waiting to be posted here on the IBM big data hub. Yes, I stay busy.