Optimal Deployment for Big Data

Hadoop has evolved as a heterogeneous enterprise data warehouse for optimized modeling, deployment, and management

Big Data Evangelist, IBM

If the Apache Hadoop data warehouse is the future of enterprise big data, what is its optimal topology? Does the time-proven three-tier topology of enterprise data warehouses apply in this brave new world? Does it make sense to partition Hadoop clusters into separate specialized tiers of access, hub, and staging nodes? I'm not at all sure.

A three-tier topology revisit

As with traditional data warehouses, centralization of all Hadoop data warehouse functions onto single clusters—the one-tier topology—has its advantages in terms of simplicity, governance, control, and workload management. Hub-and-spoke Hadoop data warehouse architectures become important when you need to scale back-end transformations and front-end queries independently of each other, and perhaps also provide data scientists with their own analytic sandboxes for exploration and modeling.

However, the huge range of access points, applications, workloads, and data sources for any future Hadoop data warehouse demands an architectural flexibility that traditional data warehouses, with their operational business intelligence (BI) focus, have rarely needed. In the back-end staging tier, different preprocessing clusters might be needed for each of the disparate sources: structured, semi-structured, and unstructured.

In the hub tier, disparate clusters configured with different underlying data platforms may be needed. These underlying platforms include a relational database management system (RDBMS), stream computing, a Hadoop Distributed File System (HDFS), Apache HBase, Apache Cassandra, NoSQL, and so on, and corresponding metadata, governance, and in-database execution components may also be necessary.

And in the front-end access tier, various combinations of in-memory, columnar, online analytical processing (OLAP), dimensionless, and other database technologies might be required. These technologies deliver the requisite performance on diverse analytic applications, ranging from operational BI to advanced analytics and complex event processing.

Yes, more tiers can easily mean more tears. The complexities, costs, and headaches of these multitier, hybridized architectures can drive you toward greater consolidation, where it's feasible. But it may not be as feasible as you wish.

The Hadoop data warehouse is expected to continue the long-term trend in the data warehouse–evolution movement away from centralized and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated architectures. The Hadoop data warehouse itself is evolving away from a single master schema and more toward database virtualization behind a semantic abstraction layer. Under this new paradigm, the Hadoop data warehouse will require virtualized access to the disparate schemas of the relational, dimensional, and other constituent database management systems (DBMSs) as well as other repositories that constitute a logically unified cloud-oriented resource.

Our best hope is that the abstraction and virtualization layer of the Hadoop data warehouse architecture will reduce tears, even as tiers proliferate. If it can provide big data professionals with logically unified access, modeling, deployment, optimization, and management of this heterogeneous resource, wouldn't you go for it?

Federation and its discontents

In classic enterprise data warehousing, the preferred topology is often some centralized model because of advantages in performance, scalability, governance, security, reliability, and management relative to federation and other decentralized approaches. Even in today's big data era, federated deployment goes against the grain of many data analytics professionals, who prefer—all other factors considered—to consolidate as much data as possible.

Considering that to store more than several dozens of terabytes on a single Hadoop node is infeasible, for example, organizations take the consolidation imperative up a level by putting as much data as possible on a single multi-node cluster. Ideally, the data is placed on a single vendor's platform—IBM, hopefully—and managed through an integrated stack of tools.

Is federation necessarily a dirty word in big data generally, or Hadoop specifically? That description may be overstating the aversion to this topology, which has ample use cases and examples in the business world where it often supplements enterprise data warehousing. Would it be gauche of me to point to the fact that IBM has many customers who have been doing data federation for years, often in the context of a data warehousing program?

Federation has started to come to Hadoop, but mostly in the speculative future tense. Most Hadoop deployments are in the single, albeit often very large multi-server cluster camp. That fact is due, in part, to the historical lack of federation at the HDFS level. But the Hadoop 2.0 specifications support the ability to horizontally federate multiple independent HDFS namenodes and namespaces. Whether and to what extent HDFS federation is adopted by business users pushing Hadoop's scalability barriers will be interesting to observe.

Workload-optimized nodes

Workload-optimized hardware and software nodes are the key building blocks for every big data environment. In other words, appliances are the bedrock of all three Vs—volume, variety, and velocity—of the big data universe, regardless of whether a specific high-level topology is centralized, hub-and-spoke, federated, or some other configuration. Appliances are also a foundation for big data regardless of whether all these appliance nodes are deployed on premises or are outsourcing some or all of it to a cloud software-as-a-service (SaaS) provider.

Massively parallel processing (MPP) data warehouse appliances such as IBM® Netezza® appliances are big data appliances, as is any vendor solution that pre-integrates and pre-optimizes Hadoop and other data analytics platforms that are optimized for the three Vs. If so, what are the core categories of workloads needed to build optimized big data appliances to support? Those big data workloads fall into the following three principal big data categories: storage, processing, and development.

Big data storage

A big data appliance can be a core building block in an enterprise data storage architecture. Chief uses may be for archiving, governance, and replication, as well as for discovering, acquiring, aggregating, and governing multistructured content. The appliance should provide the modularity, scalability, and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network (SAN) architecture such as IBM provides.

Big data processing

A big data appliance should support massively parallel execution of advanced data processing, manipulation, analysis, and access functions. It should support the full range of advanced analytics and some functions traditionally associated with enterprise data warehouses, BI, and OLAP. It should have all the metadata, models, and other services needed to handle such core analytics functions as query, calculation, data loading, and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM Smart Analytics System and Netezza Analytics.

Big data development

A big data appliance should support big data modeling, mining, exploration, and analysis. The appliance should provide a scalable sandbox with tools that allow data scientists, predictive modelers, and business analysts to interactively and collaboratively explore rich information sets. It should also incorporate a high-performance analytic runtime platform for which these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel processor, memory, storage, and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining, and predictive modeling into new frontiers of Hadoop and Hadoop MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis, and other resource-intensive big data processing.

Many big data appliances will support combinations of two or all three of these types of workloads within specific nodes or specific clusters. Many handle low-latency and batch jobs with equal agility.

A common architectural approach for transactions and analytics

Businesses everywhere are incorporating big data into their analytics infrastructure, tooling, and applications. However, analytics represents just one hemisphere in the world of database computing. The other one—transactions—is just as important, and increasingly depends on embedded analytic functions, such as clickstream analysis and decision automation, to ensure continuous optimization. In fact, next-best action depends intimately on analytics-optimized transactions, interactions, messages, and offers.

In the broader evolutionary picture, analytics and transactions will share a common big data infrastructure, encompassing storage, processing, memory, networking, and other resources. More often than not, these workloads will run on distinct performance-optimized integrated systems, but will interoperate through a common architectural backbone.

Deploying a big-data infrastructure that does justice to both analytic and transactional applications can be challenging, especially when platforms that are optimized to handle each type of workload are lacking. But the situation is improving. A key milestone in the evolution of big data toward agile support for analytics-optimized transactions was October 9, 2012 when the IBM PureData™ System appliance family was released.

This family offers workload-specific, hardware- and software-expert integrated systems for both analytics and transactions. IBM PureData System for Transactions provides workload-optimized new systems for transactions, IBM PureData System for Analytics offers data warehousing and advanced analytics, and IBM PureData System for Operational Analytics delivers real-time business intelligence, online analytical processing, and text analytics.

These big data platforms share common design principles and all incorporate the following core set of the IBM PureSystems™ family of integrated systems principles:

  • Patterns of expertise for built-in solution best practices
  • Scale-in, -out, and -up capabilities
  • Cloud computing–ready deployment
  • Clean-slate designs for optimal performance
  • Integrated management for maximum administrator productivity

Taken together, these principles enable the PureData platforms to realize fast business value, reduce total cost of ownership, and support maximum scalability and performance on a wide range of analytics and transactional workloads. These same principles are also the architectural backbone for the IBM PureApplication™ System cloud application and IBM PureFlex™ System integrated infrastructure platforms. The entire PureSystems platform family unites analytics, transactions, middleware, and operating environment technologies within a common converged architecture.

Please share any thoughts or questions in the comments.

This article comprises a blog series by James Kobielus, big data evangelist at IBM. It is offered for publication in association with the Big Data and Enterprise Architecture Conference 2013, November 20–22 in Washington, D.C., sponsored by Data Management Forum and featuring a big data analytics presentation by James Kobielus.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']