The Next Big "H" in Big Data: Hybrid Architectures

Fit-for-purpose big data platforms will play together under virtualization

Big Data Evangelist, IBM

In spite of what you may have heard, Hadoop is not the sum total of big data. Another big data "H"—hybrid—is becoming dominant, and Hadoop is an important (but not all-encompassing) component of it. In the larger evolutionary perspective, big data is evolving into a hybridized paradigm under which Hadoop, massively parallel processing (MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing, NoSQL, document databases, and other approaches support extreme analytics in the cloud.

Hybrid architectures address the heterogeneous reality of big data environments and respond to the need to incorporate both established and new analytic database approaches into a common architecture. The fundamental principle of hybrid architectures is that each constituent big data platform is fit-for-purpose to the role for which it's best suited. These big data deployment roles may include any or all of the following:

  • Data acquisition
  • Collection
  • Transformation
  • Movement
  • Cleansing
  • Staging
  • Sandboxing
  • Modeling
  • Governance
  • Access
  • Delivery
  • Interactive exploration
  • Archiving

In any role, a fit-for-purpose big data platform often supports specific data sources, workloads, applications, and users.

Hybrid is the future of big data because users increasingly realize that no single type of analytic platform is always best for all requirements. Also, platform churn—plus the heterogeneity it usually produces—will make hybrid architectures more common in big data deployments. The inexorable trend is toward hybrid environments that address the following enterprise big data imperatives:

  • Extreme scalability and speed: The emerging hybrid big data platform will support scale-out, shared-nothing massively parallel processing, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management.
  • Extreme agility and elasticity: The hybrid big data environment will persist data in diverse physical and logical formats across a virtualized cloud of interconnected memory and disk that can be elastically scaled up and out at a moment's notice.
  • Extreme affordability and manageability: The hybrid environment will incorporate flexible packaging/pricing, including licensed software, modular appliances, and subscription-based cloud approaches.

Hybrid deployments are already widespread in many real-world big data deployments. The most typical are the three-tier—also called "hub-and-spoke"—architectures. These environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the data acquisition, collection, staging, preprocessing, and transformation layer; relational-based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction layer.

The complexity of hybrid architectures depends on range of sources, workloads, and applications you're trying to support. In the back-end staging tier, you might need different preprocessing clusters for each of the disparate sources: structured, semi-structured, and unstructured. In the hub tier, you may need disparate clusters configured with different underlying data platforms—RDBMS, stream computing, HDFS, HBase, Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-database execution components. And in the front-end access tier, you might require various combinations of in-memory, columnar, OLAP, dimensionless, and other database technologies to deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.

Ensuring that hybrid big data architectures stay cost-effective demands the following multipronged approach to optimization of distributed storage:

  • Apply fit-for-purpose databases to particular big data use cases: Hybrid architectures spring from the principle that no single data storage, persistence, or structuring approach is optimal for all deployment roles and workloads. For example, no matter how well-designed the dimensional data model is within an OLAP environment, users eventually outgrow these constraints and demand more flexible decision support. Other database architectures—such as columnar, in-memory, key-value, graph, and inverted indexing—may be more appropriate for such applications, but not generic enough to address other broader deployment roles.
  • Align data models with underlying structures and applications: Hybrid architectures leverage the principle that no fixed big data modeling approach—physical and logical—can do justice to the ever-shifting mix of queries, loads, and other operations. As you implement hybrid big data architectures, make sure you adopt tools that let you focus on logical data models, while the infrastructure automatically reconfigures the underlying big data physical data models, schemas, joins, partitions, indexes, and other artifacts for optimal query and data load performance.
  • Intelligently compress and manage the data: Hybrid architectures should allow you to apply intelligent compression to big data sets to reduce their footprint and make optimal use of storage resources. Also, some physical data models are more inherently compact than others (e.g., tokenized and columnar storage are more efficient than row-based storage), just as some logical data models are more storage-efficient (e.g., third-normal-form relational is typically more compact than large denormalized tables stored in a dimensional star schema).

Yes, more storage tiers can easily mean more tears. The complexities, costs, and headaches of these multi-tier hybridized architectures will drive you toward greater consolidation, where it's feasible.

But it may not be as feasible as you wish.

The hybrid big data environment will continue the long-term trend away from centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The hybrid platform is evolving away from a single master “schema” and more toward database virtualization behind a semantic abstraction layer. Under this new paradigm, the hybrid big data environment will require virtualized access to the disparate schemas of the relational, dimensional, and other constitute DBMS and other repositories that constitute a logically unified cloud-oriented resource.

Our best hope is that the abstraction/virtualization layer of the hybrid environment will reduce tears, even as tiers proliferate. If it can provide your big data professionals with logically unified access, modeling, deployment, optimization, and management of this heterogeneous resource, wouldn't you go for it?

The architectural centerpiece of this new hybridized landscape must be a standard query-virtualization or abstraction layer that supports transparent SQL access to any and all back-end platforms. SQL will continue to be the lingua franca for all analytics and transactional database applications. Consequently, big data solution providers absolutely must allow SQL developers to transparently tap into the full range of big data platforms, current and future, without modifying their code.

Unfortunately, the big data industry still lacks a consensus query-virtualization approach. Today's big data developers must wrangle with a plethora of SQL-like languages for big data access, query, and manipulation, including HiveQL, CassandraQL, JAQL, SQOOP, Sparql, Shark, and DrQL. Many, but not all, of these of these are associated with a specific type of big data platform—most often, it's with Hadoop. I'm including IBM BigSQL (currently in Technology Preview) in this list of industry initiatives.

The fact that we refer to many of these initiatives as "SQL-on-Hadoop" is a danger sign. We, as an industry, need to go one step beyond this idea. The big data arena threatens to split into diverse, siloed platforms unless we bring SQL fully into it all as a lingua franca.

Siloed query languages and frameworks threaten to ramp up the cost, complexity, incompatibility, risk, and unmanageability of multiplatform big data environments. And the situation is likely to grow more fragmented as big data hybrid deployments predominate.

The bottom line is that hybrid big data environments will degenerate into a mess of incompatible platforms unless the industry puts a renewed focus on standardization.

What do you think? Let me know in the comments.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']