Here are the quick-hit ponderings that I posted on various LinkedIn big data discussion groups this past week. I opened up three new themes – peta-governance, prediction markets, and NoSQL = no-disk – while further developing established themes of advanced visualization and frictionless sandboxes:
July 23 Peta-governance?
Contrary to what many might think, you can indeed govern petabytes of data in a coherent manner. There is no inherent trade-off between the volume of the data set and the quality of the data maintained within.
Some believe that you can't scale out into the petabyte range without filling your Hadoop cluster, massively parallel data warehouse, and other nodes with junk data that is inconsistent, inaccurate, redundant, out of date, or nonconformed. That's simply not true.
The source of data quality problems in most organizations is usually at the source transactional systems – whether those are your customer relationship management (CRM) system, general ledger application, or whatever. These systems are usually in the terabytes range.
Any IT administrator who fails to keep the system of record cleansed, current, and consistent has lost the half the battle. Sure, you can fix the issue downstream (to some degree) by aggregating, matching, merging, and cleansing data in intermediary staging databases. But the quality problem has everything to do with inadequate controls at the data's transactional source, and very little to do with the sheer volume of it.
Downstream from the source of the problem, you can scale your data cleansing operations with a massively parallel deployment of IBM InfoSphere QualityStage – or of IBM BigInsights tricked out for this function – but don't blame the cure for an illness that it didn't cause.
July 24 Advanced visualization? Visual overload kills understanding
We may describe visualization's core value through the old adage, "seeing is believing." Raw data, numbers, statistics, and algorithms don't move the human soul. We all need regular assurance through charts, graphs, trendlines, histograms, boxplots, heatmaps, and the like that the patterns in the data indeed do exist.
The visual display of quantitative information is the heart and soul of business intelligence (BI). These days, the BI industry is keen on "advanced visualization," as if these varied approaches, like the "advanced analytics" of Big Data, are always the best way of delivering actionable intelligence. One of my favorite sources of insight on advanced visualization is Forrester Research BI guru Boris Evelson, who wrote this great blog recently on the topic.
Advanced visualization is built on deep multidimensional correlations among diverse variables. Unfortunately, the technology can produce visuals of mind-numbing complexity and dynamism. Not only do we, in practice, increasingly see a wide range of linked graphs crammed into many analytics dashboards, but often each element – report, scorecard, plot, etc. – includes far more detail than the average human can absorb. For example, the larger the underlying data tables, the more likely that even a structured report has far more rows and columns that can be displayed on a single screen. However, that doesn't stop BI developers from doing this time and again, making their reports next to unreadable.
Advanced visualization can be counterproductive, if what you're trying to do is advance human understanding. The more crowded the visual, the less likely it is to tell a coherent story. Just as you wouldn't want to throw every possible statistical analysis at a problem, you should resist the temptation to gussy up the data with every graphic blandishment in your toolkit.
July 25 Prediction markets?
Prediction markets exist so that people can speculate on the probability of future events. Financial futures markets are the best known example, but such online services, leveraging predictive analytics, big data, and cloud computing, similar concepts are starting to pop up in many industries. Many such initiatives rely heavily on analytic infrastructure to auto-generate fresh predictions from new data, but some also lean heavily on human subject matter experts to train and guide the underlying predictive models. Check this Wikipedia page for future depth and scope on this trend.
What sorts of industries are best suited to spawn prediction markets of their own? Essentially, the proving grounds are any markets that are disrupted easily, often, and/or severely.
To what extent is human subject matter expert (SME) "training" necessary in predictive markets? In more stable well-understood markets, predictive models can include all relevant factors and be trained by SMEs with ample experience in the business environment they're modeling. But these assumptions fall apart in periods of market disruption, when it may not be clear what all the relevant factors" are and in which none of the recognized SMEs have any relevant experience to draw on. That's because the experts' whole world is being turned topsy-turvy.
In times of disruption, the business analysts, risk management professionals, and other SMEs start to realize they don't have all the answers. So, hopefully, they meet regularly to review the changing landscape and sort through all the relevant variables, factors, relationships, and trends that are impacting their business, or are likely to. Much of the relevant intelligence will be in "soft" and "unstructured" formats (e.g, daily news, analyst research reports, social-media chatter) and from sources they may have never considered before (e.g., the new faddish things their children are obsessing over).
In disrupted markets, the SMEs will probably be the first to admit that they're groping for a valid consensus of the likely future. At these business crossroads, the premium will be on collaboration, sharing, and conversations among humans: within the enterprise, across the B2B value chain, with customers, and socials. Companies can't realistically delegate this high-level, ongoing pattern search (i.e., discerning the shape of the future) entirely to machines – not even to, dare I say, to IBM Watson. The more disrupted the industry, the more intensive the collaborative business futurism must become.
Which markets fit the criteria of "disrupted easily, often, and/or severely?" Clearly, financial services should be at the top of the list, at least until we repeal the laws of macroeconomics. Also, any consumer-facing industry that relies heavily on the fickle fortunes of fashion. And now, with the onslaught of digital delivery, the full range of media, broadcasting, publishing, music, news, and so forth. And, of course, the retail industry, which is finding itself rapidly being "Amazoned" to kingdom come.
All of them sorely need predictive market services, with strong human SME guidance, to figure out whether they will still have a viable business model going forward.
July 25 Frictionless sandboxes? Tight governance keeps the analytic sands in the box
On-demand sandboxing is a key requirement in your big data initiatives. Data scientists need to quickly provision – and then just as quickly de-provision – very large big data analytic sandboxes. For example, spinning up petabyte sandboxes in the cloud to support fast development of MapReduce churn and upsell models is increasingly essential for optimization of marketing campaigns, ad placements, customer experiences, and the like.
One issue with on-demand sandboxing is the potential for (dare I say?) anarchy. Frictionless provisioning makes it far easier for ungoverned, nonconformant, siloed data-science sandboxes to spring up without central oversight. Many organizations' advanced analytics operations already suffer from excessive silo-ing across modeling teams. Often, we see separate data scientist teams for marketing, finance, supply chain, and other business functions. In such environments, there is often minimal sharing of development sandboxes, best practices, models, and personnel across teams, domains, and applications.
In addition, too many organizations implement haphazard model governance within and among their disparate data scientist initiatives. To avoid fostering an unmanageable glut of myriad statistical models, your big data sandboxing environment should support strong life-cycle governance of models and other artifacts developed by your data scientists, regardless of what tools they use. At the very least, your data scientists should work in a sandboxing environment that enforces company-wide policies for model check in/check-out, change tracking, version control, collaborative development, and validation.
Frictionless need not be the same as "free-for-all" and discipline need not stifle data-scientist creativity. Your sandboxing platforms and modeling tools should ensure consistent governance automation, and managed collaboration across multidisciplinary teams working on your most challenging big data analytics initiatives.
July 27 NoSQL = no-disk?
NoSQL is an innovative new database ecosystem with few legacy approaches to constrain its evolution. It's developing in exciting new ways.
I noticed recently that Amazon Web Services had recently added a high-performance option for EC2 that dedicates NoSQL-database instances to solid state drives (SSDs). What caught my attention was their positioning of the option for acceleration of real-time-intensive, high-I/O applications of Cassandra and MongoDB. Among the real-time NoSQL use cases they called out are interactive web and mobile applications that rely on instantaneous response to user clicks and gestures. See the AWS blog for more details.
This latest announcement came on the heels of AWS' go-live a few months back of the DynamoDB service, a fully managed cloud NoSQL database that enables real-time performance through SSD storage and synchronous replication across their cloud service. AWS says the service is build to maintain consistent real-time latencies at any scale.
I haven't tested that performance claim in the lab, but it got me to thinking. Is rotating storage a non-starter in the new world of NoSQL? Will the real-time, unstructured data, read-intensive, high-IOPS requirements that are driving adoption of Cassandra and other NoSQL approaches spur adoption of SSD? Will SSD-based persistence, accelerating processing of the massive hash tables that constitute many NoSQL databases, take deeper root sooner in this space than in the rest of the big data arena? Many Hadoop deployments are batch-oriented, a fact that may retard adoption of SSD in that segment of the big data ecosystem.
What do you think?
At the end of the week, I look back over the 20-odd quick-hit themes I've introduced in the past 3 months, and have already drafted 3 new ones for the coming week (plus new chapters in a couple of other, established ones). Stay tuned.