The Role of Stream Computing in Big Data Architectures

Analyzing data in motion is a critical capability in a balanced big data infrastructure

Big Data Evangelist, IBM

Stream computing is often the outlier in discussions about big data architectures, but it shouldn't be. The core role of stream computing is to power extremely low-latency velocities, but it doesn't rely on high-volume storage to do its job.

By contrast, the big data platforms that often gain the most mindshare—the massively parallel processing architectures underlying enterprise data warehouses, Apache Hadoop, and other analytics databases—usually require high-volume storage. This storage can have a considerable physical footprint within the data center and is therefore generally more visible than a stream computing architecture, which might be distributed across smaller servers in many data centers.

Clearly, a balanced big data architecture—one that enables maximum velocity, volume, and variety—needs stream computing to supplement and integrate with other approaches. From an architectural standpoint, a comprehensive big data platform provides a latency-agile resource that persists, aggregates, and processes any dynamic mix of at-rest and in-motion information. It's best to think of this comprehensive big data fabric as consisting of multiple fit-for-purpose platforms* that incorporate specialized data persistence architectures for both short-latency persistence—caching—of in-motion data—stream computing—and long-latency persistence—storage—of at-rest data—from the enterprise data warehouse, Hadoop, and so on. Each Fit for Purpose persistence platform can be optimized to execute the various analytic models, workloads, and jobs associated with the type of data it is designed to handle.

The practical distinctions are blurring between these fit-for-purpose, big data platforms. Stream computing architectures increasingly process many of the same types of analytics that may also be executed on Hadoop or enterprise data warehouse (EDW) platforms. In addition, stream computing platforms supplement the out-of-box multi-latency capabilities of EDW, Hadoop, and other big data platforms. For example, all of IBM's core big data platforms—IBM® InfoSphere® Streams stream computing, IBM InfoSphere BigInsights™ Hadoop-based analytics, and IBM PureData® EDW software—can execute MapReduce models for advanced analytics. InfoSphere Streams rapidly ingests, analyzes, and correlates information as it arrives from real-time sources.

In the recently released InfoSphere Streams 3.0, IBM has taken its stream computing functionality to the next level of scale and sophistication. Whether deployed on its own or in conjunction with other platforms, InfoSphere Streams now accomplishes the following:

  • Handles simple and extremely complex analytics with agility
  • Scales for computational intensity
  • Supports a wide range of relational and non-relational data types
  • Analyzes continuous, massive volumes of data at rates up to petabytes per day
  • Performs complex analytics of heterogeneous data types including text, images, audio, voice, VoIP, video, web traffic, email, GPS data, financial transaction data, satellite data, sensors, and any other type of digital information that is relevant to your business
  • Leverages sub-millisecond latencies to react to events and trends as they are unfolding, while it is still possible to improve business outcomes
  • Adapts to rapidly changing data forms and types
  • Seamlessly deploys applications on any size computer cluster
  • Meets current reaction time and scalability requirements with the flexibility to evolve with future changes in data volumes and business rules
  • Supports development of new applications rapidly that can be mapped to a variety of hardware configurations and adapted with shifting priorities
  • Provides security and information confidentiality for shared information
  • Provides a new IBM Netezza loader to improve performance when loading data into the data warehouse platform
  • Integrates out of the box with IBM Information Server for improved exchange of information with IBM InfoSphere DataStage and tighter interoperability with Extract, Transform and Load (ETL) jobs in DataStage
  • Integrates out of the box with the latest release of BigInsights (version 2.0)

In addition to all these new features, InfoSphere Streams supports several optional IBM solution accelerators to support custom application development of several key real-time big data applications. Accelerators supported in this release include the Time Series Accelerator, Geospatial Accelerator, IBM Accelerator for Telecommunications Event Data Analytics, IBM Accelerator for Social Data Analytics, and IBM Accelerator for Machine Data Analytics. InfoSphere Streams 3.0 also comes standard with several toolkits—financial, mining, complex event processing, and advanced text analytics—to help provide rapid successful outcomes on low-latency big data projects in any of these areas.

In many ways, stream computing—as implemented in InfoSphere Streams—is a full-fledged, enterprise-grade runtime engine and development platform for the vast range of real-time big data applications. But stream computing can also be deployed outside of big data environments as a low-latency data integration technology for operational business intelligence, business event monitoring, and other applications that don’t require large volumes or wide varieties of data types.

In other words, stream computing can also play a central role in otherwise small data applications.

* "Introducing Fit-for-Purpose Architectures," by Tom Deutsch, IBM Data magazine, April 2012.

What do you think? Let me know in the comments.

[followbutton username='jameskobielus' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']