Streams versus Storm: What’s in a name?
Across cultures, since the beginning of time, humans have given great importance to naming of offspring. Children are named for ancestors, significant events or natural wonders. Names give meaning to people, places and objects.
Today, corporations compete for naming rights of public buildings and parks and celebrities try and outdo each other with unconventional baby names. In fact, the New York Times reports the name Cheese is trending. What if you have two children? Cheese and Crackers?
The importance of naming is also relevant when it comes to big data technologies. In the case of real-time analytic processing, many organizations think of InfoSphere Streams or Apache Storm. Streams and Storm definitely conjure up very different images. Picture a stream, perhaps in the magnificent mountains of Yorktown Heights, New York where InfoSphere Streams was developed. A gentle flow of water with greenery and a peaceful setting come to mind. Compare that with a picture a storm. Thoughts of thunder, lightening and other severe weather pop up.
In a recent performance benchmark designed to analyze both the quantitative differences in performance and qualitative differences in application development, InfoSphere Streams proves superior to Apache Storm in terms of speed, efficiency and scalability. The names are very apt to the experience of deploying real-time analytic technology.
InfoSphere Streams outperforms Apache Storm by 2.6 to 12.3 times in terms of throughput while simultaneously consuming 5.5 to 14.2 times less CPU time. Furthermore, the throughput and CPU time gaps widen as data volume, degree of parallelism and number of processing nodes grows.
InfoSphere Streams handles a heavy load much better (for example, it can make more effective use of available CPU capacity). The noticeable performance degradation with Apache Storm on meaningful workloads (typical of streaming analysis) means that the cost of application logic is very high. As a result, Apache Storm in most cases is not suited for production applications such as geospatial analytics, deep network inspection and call data record analysis.
The sophisticated and robust engineering of InfoSphere Streams ensures the ability to scale linearly and handle high loads effectively while maintaining a low resource usage footprint. The ability to scale in a near linear way and to efficiently handle high workloads with minimal performance degradation emerged from this benchmark study as the obvious differentiator for InfoSphere Streams. So what was done? A direct comparison of Streams and Storm, in a real world use case of email classification.
Email has become a primary and pervasive source of both personal and business communication. At the same time, it is the de-facto medium for spam dissemination; so much so that more than 90 percent of the total global email volume is spam. The annual cost of this unwanted and, at times, malicious content to US organizations is estimated to be in excess of $20 billion. To stay ahead, organizations use spam detecting solutions that must spot malicious email in real time. The Streams and Storm benchmark study was based on a core component of a spam detection application, the preprocessing of email in real-time.
- What was built? Preprocessor for a behavioral-based spam detection application
- How does it work? Seven step real-time email preprocessing including file read, decompression, serialization, filtering and scoring
- What was the goal? Analyze the quantitative differences in performance and qualitative differences in application development
- Key findings: Streams applications take less time to build and perform better while using less hardware
The moral of the story: names have more significance that you think.
Want to learn more? Check out StreamsDev the InfoSphere Streams developer community for and by developers.
Want to try InfoSphere Streams for yourself? No problem. Download and go with InfoSphere Streams Quick Start Edition.