Workload-optimized Systems? Built for and Building the Big Data Cloud

Big Data Evangelist, IBM

Big data is evolving into a cloud ecosystem. For example, it’s clear that Hadoop has already proven its core role in the big data ecosystem: as a petabyte-scalable staging, transformation, pre-processing and refinery cloud for unstructured content and embedded execution of advanced analytics.

Workload-optimized hardware/software nodes are the key building block for every big data cloud. In other words, appliances, also known as expert integrated systems, are the bedrock of all three “Vs” of the big data universe, regardless of whether your specific high-level topology is centralized, hub-and-spoke, federated or some other configuration, and regardless of whether you’ve deployed all of these appliance nodes on premises or are outsourcing some or all of it to a cloud/SaaS provider.

Within the coming 2-3 years, expert integrated systems will become a dominant approach for enterprises to put Hadoop and other emerging big data approaches into production. Already, appliances are the principal approach in the core big data platform market: enterprise data warehousing solutions that implement massively parallel processing (as pioneered by IBM Netezza).

What are the core categories of workloads that we’ll need to build optimized big data appliances to support within cloud environments? Those workloads fall into the following principal categories:

  • Big-data storage: A big data appliance can be core building block in a enterprise data storage architecture. Chief uses may be for archiving, governance and replication, as well as for discovering, acquiring, aggregating and governing multistructured content. The appliance should provide the modularity, scalability and efficiency of high-performance applications for these key data consolidation functions. Typically, it would support these functions through integration with a high-capacity storage area network architecture such as IBM provides.
  • Big-data processing: A big data appliance should support massively parallel execution of advanced data processing, manipulation, analysis and access functions. It should support the full range of advanced analytics, as well as some functions traditionally associated with EDWs, BI and OLAP. It should have all the metadata, models and other services needed to handle such core analytics functions as query, calculation, data loading and data integration. And it should handle a subset of these functions and interface through connectors to analytic platforms such as IBM Smart Analytic Systems and Netezza Analytics.
  • Big-data development: A big data appliance should support big data modeling, mining, exploration and analysis. The appliance should provide a scalable “sandbox” with tools that allow data scientists, predictive modelers and business analysts to interactively and collaboratively explore rich information sets. It should incorporate a high-performance analytic runtime platform where these teams can aggregate and prepare data sets, tweak segmentations and decision trees, and iterate through statistical models as they look for deep statistical patterns. It should furnish data scientists with massively parallel CPU, memory, storage and I/O capacity for tackling analytics workloads of growing complexity. And it should enable elastic scaling of sandboxes from traditional statistical analysis, data mining and predictive modeling, into new frontiers of Hadoop/MapReduce, R, geospatial, matrix manipulation, natural language processing, sentiment analysis and other resource-intensive types of big data processing.

A big-data appliance should not be a stand-alone server, but, instead, a repeatable, modular building block that, when deployed in larger cloud configurations, can be rapidly optimized to new workloads as they come online. Many appliances will be configured to support mixes of two or all three of these types of workloads within specific cloud nodes or specific clusters. Some will handle low latency and batch jobs with equal agility in your cloud. And still others will be entirely specialized to a particular function that they perform with lightning speed and elastic scalability. The best appliances, like IBM Netezza, facilitate flexible re-optimization by streamlining the myriad deployment, configuration tuning tasks across larger, more complex deployments.

You may not be able to forecast with fine-grained precision the mix of workloads you’ll need to run on your big-data cloud two years from next Tuesday. But investing in the right family of big-data appliance building blocks should give you confidence that, when the day comes, you’ll have the foundation in place to provision resources rapidly and efficiently.

Related Posts on Workload-optimized Systems

James is blogging all week about topics related to workload-optimized systems. Read his previous posts:

Check back all week for other posts in the series.