Hadoop Meets the Mainframe

InfoSphere BigInsights brings the agility and flexibility of Hadoop to System z environments

Product Marketing Manager, IBM

As the value of data-driven decisions becomes clear, organizations are increasingly seeking to incorporate mainframe data along with data from other sources into their analytics models. After all, big data is about all the data, and the mainframe is often where the most precious data resides. The recent release of IBM® InfoSphere® BigInsights™ software—the enterprise-grade Apache Hadoop framework—on the Red Hat Enterprise Linux on the IBM System z® platform is designed to offer advanced and interesting analytics possibilities for organizations. It also helps protect sensitive data and keep it secure.


The road to Hadoop

When storage was expensive and data gathering was difficult and costly, IT departments had to be discerning about what data to keep and how to store it. To retain only the data needed to run business applications, they thought in terms of well-designed schemas, relational models, and downsampling and aggregating data.

For new analytics workloads, however, many organizations have turned to Hadoop. Hadoop evolved to handle different challenges; namely, how to wade through vast amounts of unstructured or semistructured data cost-efficiently and leverage parallel processing techniques to work with data sets too large to be handled through conventional means. As Hadoop has matured and become increasingly accessible to the enterprise, IT organizations now have the opportunity to think about data management differently.

Organizations deploying Hadoop are more likely to want to keep all the data in case they need it in the future; evolve schemas on the fly; and use innovative, fast-evolving software tools to extract knowledge from raw data directly. The ability to augment the mainframe with these capabilities enables organizations deploying the System z platform to perform the following tasks:

  • Shift select activities such as extract-transform-load (ETL) tools and analytics processing to Hadoop.
  • Facilitate ad hoc analysis of mainframe data extracts, including transaction and log data, without compromising the integrity of critical source data.
  • Deploy advanced tools and skills to support quick and low-cost application development that can boost productivity.

In essence, organizations can benefit from both the security, reliability, and integrity of the mainframe and the agility and flexibility of Hadoop-based tools.


Flexibility on or off the mainframe

InfoSphere BigInsights is a multiplatform application. The same standard Hadoop can run on commodity Intel processor–based servers, IBM PowerLinux™ servers, or on Linux on System z mainframe partitions. This versatility allows organizations to apply common skill sets across multiple environments. Some sites may elect to deploy InfoSphere BigInsights on commodity clusters. However, for security or information governance reasons, other sites may decide to run InfoSphere BigInsights on the mainframe itself.

InfoSphere BigInsights incorporates a standard open source Hadoop distribution along with enhanced capabilities that organizations can elect to leverage depending on their requirements. InfoSphere BigInsights includes the following key features:

  • Big SQL: This rich, ANSI-compliant SQL with massively parallel processing (MPP) query optimization provides standard SQL access to native Hadoop data in the Hadoop Distributed File System (HDFS) framework and other data sources.
  • BigSheets: This spreadsheet-style data-manipulation and visualization tool is aimed at business professionals, allowing them to access data sets without needing any additional programming.
  • Application accelerators: These tools help speed the development of new applications involving machine log data, social media data, or textual data.
  • Advanced IDE: A comprehensive Eclipse-based IDE helps simplify the development and maintenance of big data applications.
  • Management capabilities: Auditing helps tighten security and access control, and monitoring enables controlling applications from a centralized dashboard.

Today’s mainframe systems are highly flexible and support applications that can be scaled vertically and horizontally. They also make extensive use of virtualization. The Integrated Facility for Linux (IFL) on System z is a specialized processor designed specifically to run Linux workloads. In theory, a single mainframe IFL processor with System z virtualization (IBM z/VM®) can support hundreds of virtual machine instances.

Testing with Hadoop workloads on System z to date has involved configurations ranging from two IFLs up to 40 IFLs, demonstrating near-linear scalability across a range of standard Hadoop benchmarks.1 With up to 101 user-configurable IFLs supported per IBM zEnterprise® EC12 mainframe,2 plenty of headroom can be available to run substantial Hadoop workloads efficiently.


The right balance

While not the first Hadoop distribution for mainframes, InfoSphere BigInsights does provide important new capabilities for System z, including ease of management and IBM and third-party database integration features. With InfoSphere BigInsights, organizations deploying System z mainframes can strike just the right balance between economy and operational security.

InfoSphere BigInsights for Linux on System z3 can be downloaded from the IBM Fix Central support site, and a complimentary Quick Start edition4 can be downloaded from the IBM website. Please share any thoughts or questions in the comments.

1The Elephant on the Mainframe,” IBM Systems and Technology Group, April 2014.
2IBM zEnterprise EC12 Technical Guide,” IBM Redbooks, Ivan Dobos et al., December 2013.
3 IBM Fix Central website.
4 IBM InfoSphere BigInsights Quick Start Edition website.