Blogs

Does Spark Streaming qualify as stream computing?

Post Comment
InfoSphere Streams Product Manager

stream_Blog2.jpg

There’s no commonly agreed to definition of stream computing. IBM InfoSphere Streams has for several years tried to differentiate from complex event processing (CEP) in three key areas:

  1. Single node versus clustered runtime. While this is changing, Streams applications are compiled and deployed with parts of the applications spread automatically across one or more computer nodes to perform both parallel and pipeline processing to deliver very high scalability
     
  2. Expressive programming model. Nearly all CEP engines use either a Rules engine to evaluate if-then-else style rules, or uses an in memory SQL database to perform continuous SQL queries as data arrives and is loaded into the database. These programming models limit the types of analytic models that can be performed. IBM Research considered these paradigms in the beginning years of investigation in 2003, but decided they were not powerful enough to implement customer application requirements which required fully programmatic model. Streams Processing Language (SPL) is a higher level declarative language than Java, C++ or a Rules language designed for handing streaming data. SPL delivers the ability to create very powerful analytic models ranging from oil field optimization to predictions about onset of diseases.
     
  3. Structured versus unstructured data support. CEP engines typically handle structured data, such as that you would insert into databases. Streams and SPL have been designed to analyze all manner of data from waveform data to monitor solar flares to image, video and acoustic data types.

Streams can also handle complex event processing, and so some have considered it to simply extend the definition of CEP instead of calling it a new category: stream computing. With reports like the The Forrester Wave: Big Data Streaming Analytics Platforms, Q3 2014, and all the new stream computing entries in the marketplace, it does seem clear that this is a new category.

Wikipedia provides this definition: "In computer science, a stream is a sequence of data elements made available over time. A stream can be thought of as a conveyor belt that allows items to be processed one at a time rather than in large batches.  Streams are processed differently from batch data—normal functions cannot operate on streams as a whole, as they have potentially unlimited data and, formally, streams are codata (potentially unlimited), not data (which is finite).

So should Spark Streaming be considered stream computing? It does fit the description above for clustered runtime, expressive programming model and support of unstructured data. But it is really only a small API layer on top of Spark. It decidedly cannot process data one at a time. It collects a minimum of one half second of data into a batch and then processes the batch of data.

This batch orientation has several limitations:

  • Most obviously, it cannot deliver low latency—0.5 seconds is the absolute lowest latency.
     
  • Using a batch paradigm, there is a lot of overhead to start up and tear down the batch jobs. In a pipeline where a series of Java programs are used to process streaming data, a task is launched in a Spark Streaming Slave after 0.5 seconds (or more) of data is collected. With multiple stages of the application, each stage in the pipelined process also require tasks to be launched. The Spark programming guide discusses how they can be reduced, but the basic architecture requires machine cycles to start and stop tasks which are then no longer available to analyze the data.
     
  • Data serialization and deserialization also has overhead. Each task that is part of the overall program must take the batch of data and separate it into individual records, and then, when complete, put it back into a batch.

Since Spark Streaming is really micro batching, I do not believe that Spark Streaming qualifies—small batch is still batch, not streaming. The architectural overhead of batching, task launching and deserialization impact the overall throughput and latency which are required by many customers. 

IBM InfoSphere Streams does provide the ability to handle each record as it arrives, delivering very low latency and very high throughput for streaming applications. Additionally, it automatically spreads work across a clustered runtime, has a fully programmatic approach to analytics and supports unstructured data to truly deliver on the stream computing paradigm.