Here are the quick-hit ponderings that I posted on various LinkedIn big data discussion groups this past week. I opened up three new themes–enterprise content warehouse, business process optimization, and big BI–and further developed the established themes of big data's optimal deployment model and NoSQL = no-disk. Here's what emanated from my cerebral cortex:
August 13: Enterprise content warehouse?
What's the core role of an enterprise data warehouse (EDW)? You might argue that the EDW's core function is as a governance hub: supporting policy-based persistence and management of an organization's official system of records, the proverbial "single version of the truth," for delivery to downstream business intelligence and analytics applications.
Usually, we all assume that the official records are structured data sets, hence that the EDW must be built on a relational database, or on some columnar or dimensional variant of relational. But is the notion of an all-structured "single version of the truth" still valid in the era of big data, where what we all refer as the "EDW" also pulls in data from semi-structured and unstructured sources as well (and may store that data in native formats, or as binary large objects, though just as often transforming it to structured formats for downstream analytics)?
Architecturally, then, EDW has already evolved into what we might call an "enterprise content warehouse." In its core role as a customer data hub, the modern EDW often links the official customer record to un/semi-structured objects such as the customer's digital files, folders and images, sourced from enterprise case management systems. The onslaught on social-sourced customer intelligence is also, no doubt, expanding the notion of what valuable non-relational data (e.g, sentiment, geospatial, behavioral) should be linked to the "single version of the truth."
Is the term "EDW" a vestige of the olden days? Should it make way for the new "ECW" of the big data era?
August 14: Big Data's optimal deployment model? Federation and its discontents
In classic enterprise data warehousing (EDW), the preferred topology is often some centralized model, due to advantages in performance, scalability, governance, security, reliability and management relative to federation and other decentralized approaches.
Even in today's big data era, federated deployment goes against the grain of many data analytics professionals, who prefer (all other factors considered) to consolidate as much data as possible. Considering that it's infeasible to store more than several dozens of terabytes on a single Hadoop node, for example, organizations take the consolidation imperative up a level: put as much data as possible on a single multi-node cluster, ideally on a single vendor's platform (IBM, hopefully), and managed through an integrated stack of tools (once again, we've got your back there too).
Is federation necessarily a dirty word in big data generally, or Hadoop specifically? That may be overstating the aversion to this topology, which has ample use cases and examples in the business world where it often supplements EDW (would it be gauche of me to point to the fact that IBM has many customers who've been doing data federation for years, often in the context of a data warehousing program?).
Federation has started to come to Hadoop, but mostly in the speculative future tense. Most Hadoop deployments are in the single (albeit often very large multi-server) cluster camp. That fact is due, in part, to the historical lack of federation at the HDFS level. But the Hadoop 2.0 specifications support the ability to horizontally federate) multiple independent HDFS namenodes and namespaces.
It will be interesting to see whether and to what extent HDFS federation is adopted by users pushing Hadoop's scalability barriers. What do you think are the prime use cases for that?
August 15: Business process optimization?
Next best action is a hot focus area under big data, advanced analytics, digital marketing, smarter commerce and other business imperatives. Enterprises have been doing next best action, in various forms, for years. Many companies continue to scale up and build out their next-best-action infrastructures, integrating a wide range of technologies.
Most of these next-best-action applications might be regarded as a species of business process optimization. What is business process optimization? My take is that it refers to next best action used to improve back-office outcomes (such as speed, efficiency, quality, agility, profitability, etc.), as well as (optionally) customer-facing outcomes such as retention, upsell, satisfaction, response and acceptance rates, etc.).
Down deep, next best action refers to best practices for proactively guiding and optimizing any or all steps in one or more business processes. Process automation demands decision automation, which relies on programmatic elements, rather than humans exercising judgment, as next-best-action decision agents. It is enabled through "decision engines" of all shapes and sizes, including rules engines, workflow engines and recommendation engines, and powered by business rules, advanced analytics, orchestration models and other process content. Decision engines are usually set up to take as many automated decisions as they can in accordance with complex rulebases. They offload the most routine, repetitive, cut-and-dried decisions from human decision agents. But they still must escalate the “exception conditions” to people for manual resolution. Human beings, as “exception handlers,” are still very much in the loop on most automated business processes.
Decision management is the practice of determining what blend of decision automation and decision support is necessary to optimize next best action to the business process(es) of interest: customer-facing, back-office, or some combination thereof. To the extent that it makes sense to maximize decision automation for next best action, you should automate decisions so that only the higher-value exception conditions are escalated to one or more humans for judgment-based responses. To the extent that decision support is necessary, you should leverage your business intelligence, collaboration, knowledge management and human workflow environments in your next best action environment.
What do you think?
August 16: Big BI?
Keeping your business intelligence (BI) deployment simple can be a challenge. This is especially true as your user base grows; the range of data sources, structured reports, metrics, dashboards, visualizations, and downstream applications grows. At some point in the evolution of your BI environment, it may have grown far too big for its own britches, and you may have do some serious weed-whacking. You might need to apply a little bit of BI back in on itself to figure out which tables, reports and the like are still being used, how often, and by whom. This exercise should allow you to determine which should be deep-sixed so that you can reduce requirements and costs for hardware, software licenses, data center facilities and staff.
Most real-world BI serves its core functions quite well in "small data" territory. As a general rule, you should start and stay simple on your BI strategies and deployments unless you have a compelling reason to build a more complex BI and data warehousing system. Most BI is just focused on delivering basic reports, and you may not need fancy dashboards, predictive models or continuous data updates. That's because you may have just one or two data sources and only a few users. Keeping BI costs under control boils down to doing a good job of identifying the things that an organization really needs as part of the process of gathering requirements and building a BI business case. Not overbuying on hardware is one good way to achieve the cost-containment goal.
What do you think?
August 17: NoSQL = no-disk? Quantum computing in big data's future
NoSQL is the focus on ongoing boundary-pushing in big data evolution, and not just in storage.
In the very long run, it's becoming clear (at least to me) that NoSQL and other big data approaches are converging into some sort of platform singularity–a unified architecture–that none of us can quite put our virtual fingers on yet. It will involve the heart of today's big data–scale-out, shared-nothing massively parallel processing (MPP), optimized storage, dynamic query optimization, mixed workload management, hardware optimization–and then some. All SSD will be the least of it. The future big data fabric will persist complex content transparently, in diverse physical and logical formats, to an abstract, seamless grid of interconnected memory and disk resources; and deliver intelligence with sub-second delay to consuming applications. It optimize parallel execution of advanced analytics and transactions on-demand and continuously across distributed processing, memory, storage and other resources. It will ensure diverse application service levels through an end-to-end, policy-driven, latency-agile grid.
One thing none of us has quite got our heads around is how we'll incorporate quantum computing–a growing inevitability–into big data. With the advances in this area, many of them being developed by IBM Research, you have to wonder how the distributed runtime execution environment for big data analytics will evolve in the next 10-20 years. Will CPUs, core and other discrete processing elements in the big data fabric also be virtualized through some cosmic "entanglement" (here's where I'm, perhaps, a bit over my head on the "plumbing" of quantum computing).
For those of you who think I'm talking purely blue sky, check out this week's IBM press release on our exciting work in the new field of "spintronics" for advanced storage or this recent article on other work we're doing related to quantum computing.
If we put on our thinking caps, what will quantum-powered big-data analytics allow us to do that we can't do now with MPP EDW, Hadoop, NoSQL and other approaches? For starters, complex real-time optimizations involving zillions of variables and simultaneous calculations will become a piece of cake. The most demanding of today's multi-scenario constraint-based optimization challenges will give way to seemingly unlimited processing power. Data scientists will be able to test so many alternate modeling scenarios concurrently that the most extraordinarily deep analyses be indistinguishable from magic (i.e., the late Arthur C. Clarke's definition of a "sufficiently advanced technology").
When will this all become feasible, technologically (I'm not even going to speculate about when you might see it embedded in generally available products)? Mark Ketchen of IBM’s Watson Research Center is on record as saying "I used to think it was 50 [years]. Now I’m thinking like it’s 15 or a little more. It’s within reach. It’s within our lifetime. It’s going to happen."
Within our lifetime? I'm almost 54. I suspect that it may happen not only while I'm alive, but while I'm still in the pink of my career in this industry. Technological advances are snowballing and converging at seeming light speed.
Does that make me a first-class space cadet? What do you think?
At the end of the week, I'm still getting my head around spintronics. And a Wall Street Journal I just read over breakfast: "Future of Data: Encoded in DNA".
What's NOT a potential data storage device?