Blogs

Processing real-time data streams with visual analytics

CEO, Zoomdata, Inc.

People forget there’s a good reason why we call it “streaming data” when we talk about a platform for delivering analytics around the Internet of Things (IoT) and other real-time challenges such as fraud detection, recommendation engines, sentiment analysis and more.

Visual analytics tools should be able to process real-time, streaming feeds and correlate them with more static and historical data stores. At Zoomdata, we think about data like water. This design philosophy for visual analytics profoundly changes how you approach real-time data streams as well as stored data. Both are fluid.

Let's go back in history.

http://www.ibmbigdatahub.com/sites/default/files/streamingdata_blog.jpgWhen talking about streaming, think about why we analyze data the way we do today. Traditionally we batched data—put it into data warehouses and data lakes. We did this because technology was only ready to process data as batches. It was too expensive to process as streams. But recently new technology, including Kafka, Spark and IBM InfoSphere Streams, has made it possible to deal with data as real-time streams.

Let’s consider the earliest forms of recorded data. In ancient marketplaces, people would barter spices, shells or other goods. Ultimately, they started using some form of currency—and by doing so they were creating a stream of transactions. Those commerce transactions were done in public, creating observable transactions streams, but people wouldn't keep track—they would just disappear into the atmosphere.

Then people began to analyze their companies and started tracking finances and business performance with crude tools like ledgers and the abacus. They’d write down transactions as they came in, capturing them in books and scrolls. Eventually, we developed calculators and then computers that could analyze these streams.

Somewhere along the way we lost the inherent streaming nature of data. Maybe it was when we started punching the data onto punch cards for those early IBM mainframes. Perhaps it was the advent of storing data on magnetic media. Or it could have been the necessity to move data from one place to another using only physical magnetic media.

Why do we still batch process data?

In the past we had to because the design of data architecture was forced by data transportation limitations. Databases were not meant for streaming. Even networks couldn't move the data continuously and reliably because of bandwidth. Deploying streaming architectures was extremely complicated and expensive. It started with stock tickers (by the way, the highest-paid non-bonus job on Wall Street in the 1920s was running the stock ticker) where people had time-sensitive needs.

The typical data environment today still handles data as it was before streaming technology was possible. Transactions occur in one place, and then the data is batched up and sent to data warehouses for analysis and storage. Any analysis happens against batches. The whole Hadoop infrastructure was designed initially for batch processing. This means even some of the most modern technology frameworks are designed for batch.

It's time for a change.

Ideally, we should stream data immediately upon creation of a single point of record where the data would live. If we can do so, it gives us complete transparency into the lineage of data, and also analysis from a single-source data location. This way, we know we're operating against the most recent and correct data, without duplication.

By putting the analysis directly into the place where data is stored (called by many the data lake or data hub concept), not only does this process provide access to the freshest and cleanest data, it also leverages the power of the computers storing that data (process locally). Bringing our questions directly to the data lets us get more efficiency out of computers (and is much superior to making a copy onto another system).

As human users of technology, we're used to dealing with streams of data. Music that we used to ship and play in batch ways is now enjoyed as streams. Streaming video is the way most of us get our movies and TV shows. It’s more efficient—you don't need the whole catalog in your house—and it appears whenever we need it. Users expect streams today. Streaming data and analyzing streams of data should be the default, enabling businesspeople to interact simply with data.

We face a world where the volumes of data and types of data are increasing dramatically. Computers are already making decisions on data without humans in the loop—look at algorithmic trading, airbags and more. For some systems it makes sense for computers to analyze streams of data, where the computer is a better performer compared to a human. But for most applications, you still want and need humans to be able to explore and analyze the data.

Get more information on Zoomdata’s visual analytics tool for streaming data.