The Enterprise Data Warehouse is Virtualizing into the Big-Data Cloud

Big Data Evangelist, IBM

What is an enterprise data warehouse (EDW)? You’d think the industry would have achieved some consensus on this topic by now, but you’d be mistaken.

The EDW holy wars continue even as big data pushes EDWs, however defined, to a more narrowly circumscribed niche. Some have argued that an EDW requires a database management system (DBMS), but is not, in itself, a DBMS. I’ve heard it said that a DBMS only becomes an EDW when it incorporates a schema and stores data. Still others argue that an EDW is something entirely distinct from a DW (without the “enterprise”) modifier, a data mart, or an operational data store (ODS).

At heart, all of these concepts point to the foundation of any analytics initiative: an infrastructure for persisting, preparing and delivering intelligence to downstream applications. Fundamentally, a traditional DW is an analytics-optimized DBMS for storing and managing structured data. Its architecture usually incorporates some variant or blend of relational and/or dimensional logical schemas, as well as columnar and/or row-based physical structure and disk-based and/or in-memory persistence. Adding the “E” signifies that the DW:

  • Provides an analytics-optimized information persistence and delivery layer.
  • Aggregates information into integrated, nonvolatile, time-variant repositories under unified governance.
  • Organizes information into subject-area data marts that correspond with one or more business, process and/or application domain.
  • Supports flexible deployment topologies such as centralized, hub-and-spoke, federated, independent data marts and ODSes.
  • Enables unified conformance and governance of detailed, aggregated and derived information, as well as associated metadata and schemas, by business stakeholders.
  • Extracts, loads and consolidates information from sources through various approaches.
  • Governs the controlled distribution of information to various downstream repositories, applications and consumers.
  • Maintains the availability, reliability, scalability, load balancing, mixed workload management, backup and recovery, security and other robust platform features necessary to meet the most demanding, changing enterprise mix of analytics, data management and decision support workloads.

This points to a key trend in EDW evolution: the continued transformation of these infrastructures away from traditional centralized and hub-and-spoke topologies toward the new worlds of virtualized cloud architectures. The trend is toward virtualized big-data clouds that are geared both to the traditional EDW roles supporting BI and operational reporting, and to the new world of advanced analytics for social media analytics, sentiment analysis and many other compute-intensive functions. The EDW itself is evolving away from a single master “schema” and more toward a semantic abstraction layer and use of distributed in-memory information as a service. Under this new paradigm, the next-generation EDW will support virtualized access to the disparate schemas of the relational, dimensional and other repositories that constitute a logically unified cloud-based resource.

Future EDWs may totally lack the traditional underpinning of structured RDBMSes, especially as Hadoop, NoSQL and other approaches support virtualized data persistence. For traditional EDW professionals, that trends is a bit scary. They approach data virtualization in much the same way that astronomers contemplate dark matter. They know it’s essential to bind the entire sprawling cosmos into a unified system, but they’d prefer to train their telescopes on the brightest big-data clusters (Hadoop, NoSQL, etc.) instead.

I recently came across an article that introduces EDW professionals to the age-old topic of data virtualization (aka, data abstraction, data federation, enterprise information integration, etc.). It’s a good discussion, but doesn’t break any new ground. I couldn’t help thinking that it’s almost identical to data virtualization as it was discussed in the very recent past, before the term “big data” was dropped into the conversation. Beyond adding “graph,” “key/value” and “document” to the heterogeneity of databases being virtualized, this is nothing new.

That’s actually not a bad thing. If enterprises incorporate data virtualization middleware into their big-data strategies, they can ensure coexistence and interoperability of established technologies alongside Hadoop, in-memory, graph, document, key/value and other new approaches for storing, managing and analyzing new types of data. You can preserve and even expand the distributed, multi-tier, heterogeneous and agile nature of your big-data environment if you implement a virtualization capability in middleware, in the access layer, and in the management infrastructure. You can even mix and match different big-data “form factors”—software, appliance, cloud/SaaS, etc.—through virtualization.

Virtualization provides a unified interface to disparate resources, so that you can change, scale and evolve the back-end without breaking interoperability with tools and application. One of the key enablers of big-data virtualization is the semantic abstraction layer, which enables simplified access to the disparate schemas of the RDBMS, Hadoop, NoSQL, columnar and other data management platforms that constitute a logically unified data/analytic resource. As an integration architecture, virtualization ensures logically unified access, modeling, deployment, optimization and management of big data as a heterogeneous resource. It is key if we’re to ensure that the fast-evolving big-data platform can meet all of these emerging imperatives:

  • Provide an analytic resource of elastic, fluid topology
  • Provide an all-consuming resource that ingests information originating in any source, format and schema
  • Provide a latency-agile resource that persists, aggregate and processes any dynamic mix of at-rest and in-motion information
  • Provide a federated resource that sprawls within and across value chains, spanning both private and public clouds, across the Smarter Planet

Clearly, the degree to which you need data virtualization depends in part on the complexity of your business requirements and technical environment, including the legacy investments with which it must all interoperate. It also depends on your tolerance for risk, complexity and headaches. Organizations are hybridizing any and all of these big-data technologies.