Analyzing the evolution of streaming analytics architectures
Part 2 of 2
The Lambda Architecture offers a widely used framework that addresses the convergence of streaming and batch analytics in a big data world. Nathan Marz came up with the term based on his experience working on distributed data processing systems at well-known organizations such as Twitter. This architecture describes a three-layer architecture with responsibilities and characteristics for each layer that can be implemented by different technologies. IBM provides products that are needed to implement the lambda architecture.
Part one of this two-part series looks at the business drivers and technological innovations that are bringing streaming analytics into the mainstream in many industries and for many uses. It also discusses the evolving architecture of streaming analytics in hybrid, big data analytics infrastructures. This second and concluding installment takes a more granular look at the architectural issues around streaming data technologies.
Acting on the data
The Lambda Architecture handles streaming and historical data, but like many architectures, there are pros and cons that determine its effectiveness for specific requirements. It treats all data as immutable data and recalculates updated batch views across all the historical data in much the same way as Apache Spark Streaming. This recalculation increases the amount of data stored and the computation required to maintain the batch views. This approach is potentially more costly in terms of resources than the benefits it provides for specific requirements.
In addition, the Lambda Architecture is based on keeping up-to-date views of data for traditional request response–style queries. This approach is useful to evolve analytics for fast response times. Leaders in this space are acting on the data as it happens rather than making repeated queries. Idea Cellular offers one example.
How does the Lambda Architecture map to the chief streaming and batch components of the IBM analytics platform? IBM BigInsights can be used for the Lambda Architecture’s batch layer that holds a repository of data and pre-computes the batch views. IBM InfoSphere Streams can be used for the speed layer that computes the incremental real-time views. The serving layer batch and real-time views can be stored in BigInsights and queried using the BigSQL capabilities of BigInsights. Organizations can implement variations of this layer to suit specific needs.
InfoSphere Streams also provides an alternative in conjunction with a resilient data feed such as Apache Kafka. The feed uses similar components but enables computing all views once through the stream processing engine and removes the need to constantly re-compute batch views over the entire data set. Streaming jobs can also be used to re-compute historical data from Kafka using its own log of the data or by re-feeding it to Kafka from Apache Hadoop.
Designed to be open, InfoSphere Streams packages and leverages about 70 open source components, from the Apache Batik toolkit to the Eclipse Framework to Zookeeper. In early 2014, IBM created an open source site for InfoSphere Streams on the GitHub.com website. InfoSphere Streams now has about 35 projects, ranging from adapters for the Hadoop Distributed Files System (HDFS), the MongoDB database and the Apache HBase database to functional toolkits such as a regular expression (regex) evaluation tool.
Some organizations also consider using Spark with their existing Hadoop or database environment. Spark Streaming handles the streaming requirements or InfoSphere Streams can be used because it has analytics toolkits and integrates with the Spark machine-learning library (MLlib).
Detecting insight in data streams
Like building a house or trying to find the best way to keep cool during the hot summers, the chosen architectural model needs to be well suited for the project at hand. After deciding on an architectural model, the next step is to think about how to implement the solution. Stream computing continuously integrates and analyzes data in motion to deliver analytics, and it consists of both a development environment and a high-speed runtime architecture. It also enables organizations to detect insights—risks and opportunities—in data streams that can be detected only at a moment’s notice. High-velocity data flows remain largely unnavigable from sources such as market data, Internet of Things sensors, mobile devices, clickstreams and transactions.
Many stream computing solutions provide a development environment that allows developers to build applications in the language of their choice. For example, both InfoSphere Streams and Spark Streaming support Scala and Java. Some vendors provide integrated development environments (IDEs) or consoles for administrators and developers to monitor application health and performance. Integration with business intelligence (BI), warehousing or visualization tools such as Zoomdata, Datawatch, IBM Cognos or Microsoft Excel is essential to easily enable the use of streaming data. Much more can be said about the features of stream computing solutions, but build and high-speed runtime offer the basic building blocks.
Even while enjoying some time at the beach this summer or journeying to a favorite vacation spot, take some time to think about data streams. Is your organization creating them or connecting to them? How can you analyze them? The answers to these questions can no doubt make your work highly productive when you return from the beach or vacation.
InfoSphere Streams is now available in a cloud-based offering for developers looking to try it out quickly. Learn more about IBM streaming analytics in an analyst report. Also, give the Quick Start edition a trial run. If performance is a top concern, take a look at this InfoSphere Streams benchmark study. And then experience the full power of the IBM advanced analytics portfolio, including InfoSphere Streams. Also, register today for Insight 2015, the premier forum for the insight economy.