As part of IBM's ongoing commitment to Hadoop and the broader open source ecosystem, IBM is joining forces with Databricks, Cloudera, Intel and MapR to broaden support for Apache Spark. IBM's goal is to provide enterprise customers with access to the latest innovations around big data and analytics.
What is Spark and what does it mean for Hadoop?
Spark is an open source engine for fast, large-scale data processing that can be used with Hadoop, boasting speeds up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk. As with the early enthusiasm around Hadoop, Spark should not be thought of as a singular platform for analytics, as it can be used with existing investments for the widest variety of data types and analytics workloads.
Hadoop is a significant innovation as it helps solve two major challenges around big data and analytics:
- First is the ability to process poly-structured data (Structured, semi-structured, unstructured)
- Second is the ability to do this at large scale at low cost
These benefits have helped drive initial interest and adoption of Hadoop; however, these early adopters are seeing the need for additional capabilities to deploy Hadoop as a first class citizen in their analytics architecture. Four key areas of Hadoop that need to mature in order to drive wider adoption include performance, the reduction of skills, data governance and deep integration with existing technologies.
What's exciting about Spark is that it helps increase the performance of Hadoop. A major latency bottleneck in Hadoop is MapReduce, which is great for large batch jobs (which it was originally designed for), but not suitable for more varied and complex enterprise workloads. Spark provides a distributed framework and engine for processing data in Hadoop (it can also be used without Hadoop) leveraging in-memory capabilities. Early performance benchmarks show performance improvements of 10 times to 100 times over MapReduce. IBM recognized the MapReduce latency issue from the very beginning of its Hadoop offering by providing "Adaptive MapReduce" to its BigInsights Hadoop platform, helping increase performance 6 to 20 times. IBM’s support for Spark continues the strategy of providing performance-improving capabilities for Hadoop.
This support for Spark also continues IBM’s history of innovation and investment in big data and analytics, including:
- IBM BigInsights. Hadoop offering that provides open source Hadoop components along with additional enterprise-class capabilities
- InfoSphere Streams. World’s first, industry leading, ultra low latency streaming analytics platform
- BigSQL. A full-featured, high-performance SQL engine for Hadoop
- Text Analytics Engine. Highly accurate framework and engine for unstructured data analysis
- BigR. R deployed on Hadoop to leverage Hadoop distributed processing capabilities
- Big Match. Highly scalable entity matching for master data management
- Identity Insights (G2). Context computing engine
- IBM Research innovations in the Early Access Program. Machine learning, graph analytics, sentiment analysis
These IBM innovations, combined with open source innovations such as Spark, are a great step forward in providing our clients with the widest range of analytics on big data. For more information on IBM Big Data & Analytics, please visit ibm.com/bigdata. For more information on Apache Spark, please visit spark.apache.org.