Achieving new milestones in open analytics

Big Data Evangelist, IBM

The foundational role of Apache Hadoop in open analytics ecosystems is undisputed. It is the clear focus of data science, cognitive computing and big data analytics ecosystems everywhere. Hadoop provides an open platform on which today’s data scientists build innovative analytics. Collectively, Hadoop, Apache Spark, R and other open source tools and languages provide a growing stack of open source analytics code.

Several announcements at Strata + Hadoop World 2016 showed that this growing open source stack continues to develop. The key open analytics ecosystem milestones signal advances in interoperability, partnering and platform integration.

Interoperability framework

The interoperability milestone was the release by ODPi of core specifications for interoperability and certification. Specifically, ODPi, a nonprofit group in which IBM is a charter member, celebrated a new milestone with the introduction of the first runtime specification, test suite and reference build for its Hadoop interoperability framework. The new ODPi Runtime Specification 1.0 fully leverages and aligns with relevant open source initiatives under the Apache Software Foundation (ASF): 

  • It descends from Hadoop 2.7.
  • It features the Hadoop Distributed File System (HDFS), YARN and MapReduce components.
  • It leverages Apache BigTop for packaging, testing and configuration of Hadoop-based applications.
  • It includes guidelines on how to incorporate additional compatible functionality and make the source code available through Apache community processes. 

The new ODPi Test Suite links tests directly to lines in the ODPI Runtime Specification. And the new ODPi Reference Build assists developers in assuring that their builds comply with the runtime specification.

Taken together, these new ODPi deliverables enable developers to build applications once and certify them to run across diverse Hadoop distributions. Later in 2016, ODPi plans to release its Operations Specification, a follow-on component that is expected to help users improve installation and management of Hadoop and Hadoop-based applications. This specification covers Apache Ambari, the open source project for provisioning, managing and monitoring Hadoop clusters.

Partnering initiative

IBM announced the pilot of its Open Analytics Ecosystem for the partnering milestone. It is an initiative that launches at Spark Summit West in early June 2016. Under the program, IBM plans to build relationships within the open analytics community directly with the business leaders, applications makers and technology experts. Among open analytics codebases, this partnership program focuses principally on Spark.

By the end of 2016, IBM plans to have signed up more than 100 open ecosystem partners who are actively participating, and IBM is expected to reach out to a wide range of potential community members. They include professionals in development operations, data analysis, data engineering, data science, data visualization, application development, data architecture and others who want to get involved in strategic business initiatives around Spark. key partnering criteria focuses on the level of nonmonetary contributions to Spark and other open source data analytics communities. Contributions such as code, resources and training are expected to drive efforts by IBM to build a robust ecosystem for open analytics. The contributions should prove to be highly critical in determining partners’ status in the program:

  • Have they attended, hosted and spoken at open analytics meet ups?
  • Have they contributed complimentary tools, course materials and expert training for open analytics?
  • Have they contributed bug fixes, feature requirements and project tracking for open source analytics projects?
  • Have they contributed applications and use cases, architectures with design plans and complete data products? 

Partners will benefit from participating in several areas. Their business leaders stand to gain training and skills-development opportunities, membership in a partner advisory council, use case examples and more. Development experts are expected to benefit from code contribution training, roadmap reviews, expert code reviews, discounted reference books, co-development of prototypes and integrations and more. Application builders can enjoy access to reference architectures, design workshops, data sources, free open analytics technology, assigned agile coaches and more.

Platform integration

IBM’s delivery of new Spark capabilities for z Systems mainframes and customized partner solutions for using Spark on z/OS represented the platform integration milestone. Available now, the new offering enables data scientists to analyze data in place and in memory on the mainframe of origin without the need to first extract, transform and load it. Developers and data scientists can use their existing expertise with programming languages such as Scala, Python, R and SQL. And optimized data abstraction services help simplify access to enterprise data in traditional formats such as IMS, Virtual Storage Access Method (VSAM), IBM DB2 for z/OS, partition data set extended (PDSE) or System Management Facility (SMF).

In terms of fostering an open analytics ecosystem around Spark on z/OS, this announcement ensures that developers in every industry and geography can access and analyze data that’s already stored on mainframes. The z/OS Platform for Apache Spark and partner solutions are expected to enable data scientists and data wranglers—who are charged with gathering data from different sources—to use the formats and tools they prefer for collecting and analyzing data. In addition, z Systems has also established a new GitHub organization for developers to collaborate and build tools around z/OS on Spark.

Visit the Spark community site to participate in the open analytics ecosystem around Spark, and you can join ODPi to learn more about the open ecosystem of big data . Also be sure to start your open analytics journey today at the Open For Data site.