Why a Build-It-Yourself Approach Doesn’t Cut It
InfoSphere Streams offers impressive, real-time analysis of data in motion for complex analytics
What do IBM® InfoSphere® Streams stream computing, Apache Storm distributed real-time computation, and Loggly cloud-based log management have in common? Each sets its sights on offering organizations log management performance at scale, valuable analytical insights, and productivity.
Client organizations often ask IBM to compare InfoSphere Streams to the alternative, which may be either maintaining the status quo or opting to build infrastructure versus buying it. Interestingly, Manoj Chaudhary at Loggly posed a similar question in a blog post1 discussing Loggly’s experience with Storm versus alternative approaches and why the organization decided to stop using Storm in operations requiring high performance. Bear in mind that Loggly does, however, deploy Storm further downstream where there is reduced stress on the system.
Without knowing all the details, the blog posting clearly points out that Loggly’s use of Storm and the subsequent architecture had to change to meet its business requirements. A combination of guaranteed log delivery with the required performance was the problem. Storm could support only 200,000 events/second (EPS) on average when Ack’ing/tuple was turned off, and a slow 80,000 EPS on average by comparison when Ack’ing/tuple—aka guaranteed delivery—was turned on. In its quest to optimize performance, Loggly decided to move away from Storm, and as a result it could support sustained rates of more than 100,000 EPS, which presumably is inclusive of end-to-end processing.2
Impressive performance, but at what cost?
In digging a little deeper and reviewing a Loggly presentation to learn more, slide 28 presents a chart showing that its optimized C++ threaded log collectors can support a 250,000 EPS ingest on an m2.2xlarge—four cores—Amazon Web Services (AWS) instance.3 That result is impressive, but what level of development effort was required to reach that performance level? The AWS instance type may also be in error, because slide 38 in the presentation documents using c1.xlarge instances—eight cores. This instance difference is also mentioned 43 minutes into a complementary YouTube presentation of the case study.4
Here’s where the comparison comes into play. In support of a client organization’s request to compare InfoSphere Streams to Storm, IBM published a white paper that provides a stark, apples-to-apples comparison of throughput and processor utilization in a representative application-email processing scenario for spam detection.5 For the tested benchmark application and scenarios, InfoSphere Streams provided 2.6 to 12.3 times enhanced throughput compared to Storm while simultaneously consuming 5.5 to 14.2 times less processor time than Storm.6 Further, the throughput and processor time gaps widened as data volume, degree of parallelism, and/or number of processing nodes grew (see figure). In other words, Storm consumed more hardware and delivered far less performance than InfoSphere Streams in the benchmark tests.
Single-node throughput in a real-time statistical features calculation pipeline for streaming email
When including hardware, software, development, and maintenance costs over a multiyear return on investment (ROI), InfoSphere Streams demonstrated significant cost-effectiveness. Loggly, on the other hand, learned from experience that Storm did not perform sufficiently for its needs, and it decided to build its infrastructure.
IBM also published a log processing benchmark,7 including data, code, and results, in response to a request by a client organization to evaluate several alternatives in a blind comparison. Although admittedly this comparison was not an apples-to-apples comparison because Loggly is designed to deliver a comprehensive log management solution, a comparison can be drawn using a small step in the process—either the collector or ingest operation.
InfoSphere Streams running on a single, 8-core host can sustain 3,500,000 EPS and, as the benchmark shows, can scale up to 14,000,000 EPS using four 8-core hosts and 10 Gigabit Ethernet (10GbE) connectivity.8 But this configuration is only for 70-byte messaging while Loggly is processing 300 bytes. As a result, even if InfoSphere Streams performance is discounted by a factor of 4.3, 814,000 EPS can be achieved compared to 250,000 EPS for the Loggly collector on an 8-core host. And even if the number of cores was wrong, and InfoSphere Streams ran on a 4-core host, InfoSphere Streams can deliver 407,000 EPS.
Build-it-yourself infrastructure uncertainty
Was opting for a build option over a buy option the right choice for Loggly? Many IBM client organizations would answer “no,” and that answer doesn’t preclude them from innovating and creating competitive value. But recognize that Loggly also concluded that not all streaming problems can be solved by Storm because of performance or functional reasons. Loggly chose to take a custom, build-it-yourself path. Very likely the collectors deliver functional value to client organizations even though the time it took to develop them is unknown and exactly what they do is largely a guess beyond ingest, validate, and parse.
However, to use a comparable example, InfoSphere Streams offers up to three times enhanced performance compared to the optimized C++ threaded log collector developed by Loggly. Imagine how much time Loggly may have saved, or how they may have otherwise applied an expanded budget to enhance high-value capabilities. Like client organizations for IBM, Loggly could potentially deploy infrastructure cost-effectively and help improve developer productivity by implementing InfoSphere Streams. There is still time—perhaps for Loggly Gen3?
Please share any thoughts or questions in the comments.
1,2 “What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope,” by Manoj Chaudhary, Loggly blog post, April 2014.
3 “Unmeltable Infrastructure at Scale: Using Apache Kafka, Twitter Storm, and ElasticSearch on AWS,” by Jim Nisbet and Philip O’Toole, AWS re:Invent SlideShare presentation, November 2013.
4 “Infrastructure at Scale: Using Apache Kafka, Twitter Storm, and ElasticSearch,” YouTube video presentation of a Loggly case study, November 2013.
5,6 Based on benchmark testing performed April 2014 documented in “Of Streams and Storms: A Direct Comparison of IBM InfoSphere Streams and Apache Storm in a Real-World Use Case – Email Processing,” by Andrew Bainbridge, Eric Bouillet, Zubair Nabi, and Chris Thomas, IBM Research Dublin and IBM Software Group Europe white paper, April 2014.
7,8 InfoSphere Streams applications for implementing a set of complex event processing applications for analyzing log messages in a downloadable Streams Applications Zip file, developerWorks, February 2012.