Spark and R: The deepening open analytics stack
Open source approaches continue to disrupt established markets and foster amazing innovations. In the IT world, open source software is penetrating every pore of established vendor ecosystems and recrystallizing all platforms, tools and applications into new, highly agile configurations. Open source software platforms are successful not because they’re perfect, but because they boost productivity throughout the economy by accelerating reuse, sharing, collaboration and innovation within entire industry ecosystems.
The Spark trajectory
The centerpiece of open data science is Apache Spark, which has matured rapidly into an open analytics platform for robust in-memory, machine learning, graph and streaming analytics. Check out my recent blogs about Spark in the larger context of open source initiatives including platforms, ecosystems, languages, tools, application programming interfaces (APIs), expertise and data and the context of Spark and the Apache Hadoop open data platform.
In 2016, we have seen Spark continue its trajectory toward mainstream adoption in diverse big data, advanced analytics, data science, the Internet of Things and other application domains. In addition to Spark, the core components of the deepening open analytics stack include Hadoop and the R programming language. Taken together, these open source tools constitute the core workbenches being used by data scientists to craft innovative applications in every sector of the economy.
Data scientists are the core developers in this new era, and they have strong feelings about the open analytics stacks at their disposal. Their productivity depends directly on the ease of use, performance and integration of their core open analytics development platforms, tools and libraries, including Spark, R and Hadoop.
Programming language bindings
For working data scientists, a key innovation came to market in the summer of 2015 when the release of Spark 1.4 was announced at Computerworld. The overview of that announcement notes that its chief new feature is SparkR, a language binding for R programming in Spark projects. At Spark Summit 2015, I summarized the Spark community’s plans to roll out additional language bindings beyond R, Python, Java and Scala.
The SparkR binding, based on the DataFrame API, is a very significant addition to the core Spark codebase. It enables R developers to access the environment’s scale-out parallel runtime, leverage Spark’s input and output formats and call directly into Spark SQL. In this way, R, which was designed to work only on a single computer, can now run large jobs across multiple cores in single machines and across massively parallel server clusters. As a result, R has become a full-blown, big data analytics development tool for the era of Spark-accelerated in-memory, machine learning, graph and streaming analytics.
On 6 June 2016, IBM will be making important announcements for ensuring that R, Spark and open data science continue to drive innovative business applications. At the Apache Spark Maker Community Event, IBM will host a stimulating evening of keen interest to data scientists, data application developers and data engineers. The event features special announcements, a keynote, a panel discussion, and a hall of innovation. Several leading industry figures have committed to participate:
- John Akred, CTO at Silicon Valley Data Science
- Matthew Conley, data scientist at Tesla Motors
- Ritika Gunnar, vice president of offering management, IBM Analytics, at IBM
- Todd Holloway, director of content science and algorithms, at Netflix