Spark speeds towards the next data processing revolution
“A mighty flame followeth a tiny spark.” —Dante Alighieri
If you know anything about Apache Spark, you know that its chief claim to fame is speed. With in-memory processing, Spark can achieve ten to hundredfold improvements or more in data-processing times compared to traditional MapReduce. Spark is also highly flexible, supporting a wider variety of workloads than traditional systems. But faster and highly flexible don’t do justice to the sea change for data processing that is enabled by Spark.
My first experience with the Internet was through a 300-baud modem. When 1200 baud came along, life got better, and when I upgraded to 2400 baud, I thought I had died and gone to heaven. But those changes were incremental. The emails I was sending and files I was transferring went faster than before, but my horizons had not broadened much.
Broadband’s arrival was a different story. The quantum improvement in speed, along with the fact that it was always on, enabled vast new application areas that simply did not exist before. Video, gaming, streaming music, real-time communication, video conferencing and more suddenly took the Internet by storm. These applications in turn enabled and attracted a new class of Internet end user, and the modern Internet era was born. Not until the recent mobile explosion did we see another revolution as significant.
We now stand on the brink of a similar revolution in data processing applications, all driven by the quantum improvement in speed enabled by Spark. Suddenly, the world of data analytics can be interactive rather than batch oriented. Combine interactivity with more intelligent software and more user-friendly tools than ever before, and a new class of front-office business users is coming to data analytics. Brand managers, ad campaign managers, marketing managers and executives of all types can now get fast insights into their business directly.
Overstating the importance of this change is difficult. Before Spark, business users had to rely on IT for their insights. They didn’t have the skills required, and, just as important, they didn’t have the time to spend wrangling data, writing queries and waiting for results. The typical time to answer a question was measured in days, and only then after all the required systems were set up. That first-time setup could easily take months. While Spark does not solve all the problems of data setup, it does enable business users to ask questions and get answers interactively, while they wait, without days of waiting and people wrangling. This ability is huge. For the first time, business users have the power to answer their own questions based on data analysis.
Spark fundamentally changes the types of questions that can be asked. As the speed of business keeps accelerating, the value of knowing what happened pales in comparison to the value of knowing what is happening right now. Imagine the power of knowing how your ad campaign is performing this very minute. Imagine the benefit of knowing where sales are spiking or tanking every day; where inventory is low or high as sales are made; and which patients are responding to care and which are not—all right now.
Spark enables these kinds of insights through the combination of its in-memory, lightning-fast processing and its ability to stream in new data in real time. Spark streaming enables a new class of real-time data applications that open up a whole new world of possibilities. Real-time sales data, Internet of Things data, campaign-performance data and more can all be streamed into Spark and to the screens of business users at the speed of Spark.
Clearly, Spark is aptly named. It is sparking the next revolution in data analytics and fast-cycle big data processing. Like bandwidth, Spark is an enabler of great things to come. It makes the things we’re already doing faster than ever, but more importantly it opens up new vistas that were only imagined before. A lot of work and application development to get there will be required, but the spark has been ignited. The fire is sure to follow.
You can learn more about IBM’s deep commitment to Spark and engagement with the partner ecosystem by visiting the following online resources:
- IBM Spark
- IBM Spark Technology Center
- “IBM Announces Major Commitment to Advance Apache Spark, Calling It Potentially the Most Significant Open Source Project of the Net Decade,” IBM press release, June 2015.
- IBM Big Data University Spark Fundamentals course
- More Spark blogs and resources