Getting the Big Data Ball Rolling

Data in motion is an important part of the big data story, and here’s why

Executive IT Specialist, Competitive and Product Strategy, IBM Analytics, IBM

Several years ago, IBM introduced the IBM® Smarter Planet® concept and its complementary tagline: “Instrumented. Interconnected. Intelligent.” This concept seemed like a vision into the far future. It looked like an interesting view of things to come, but how would we get there and what would make this concept intelligent?

Sure, we knew about the use of things such as radio-frequency identification (RFID)—active or passive—and we knew about sensors being deployed in many places. Still, the Smarter Planet concept seemed like a future that could or could not happen. Then all of a sudden, we started hearing about big data; smartphones; smart meters; network-connected—smart—appliances; connected cars; and the Internet of Things. The future, it seemed, had arrived.

A lot of discussions have taken place about how to store all that data and analyze it. But those discussions often left out an important piece of the puzzle: data in motion and the distinction between it and data at rest.

Data at rest

Data at rest has been a staple of data processing. It still is crucial, and there are no reasons why it should not stay that way for a long, long time. Data at rest is used in transaction processing systems, data marts, and data warehouses. The benefits of data at rest are well documented, so there is no need to rehash them here. Instead, consider two shortcomings.

The first shortcoming is that data must be stored first before it can be processed. This criterion introduces a delay between getting the data and generating information. It also means that when the analysis is complete, it represents a snapshot of the state of the enterprise. As long as we consciously evaluate this lag and find it acceptable, then everything is okay. There are situations in which the interval between getting the data and generating actionable information makes the data-at-rest approach unacceptable. At the limit, a situation may exist in which the data collection volumes match or exceed the load speed. Then the data-at-rest approach is, at least, not a complete answer.

The second shortcoming relates to data retrieval for analytics. Data repositories use many strategies such as indexing and partitioning to speed up data retrieval. For analytics, the best organization appears to be a well-planned partitioning. Still, there is always an overhead related to retrieving the appropriate subset of data. Either way, you generally retrieve more data than you really want. In some cases, the overhead could result from having to re-create the context for the latest data, such as performing a total-to-date calculation. Each time the snapshot is updated, the entire calculation must be redone.

Data in motion

Before doing anything with data, it has to be in motion. The data is read from a file, and that action takes the file data and puts it in motion. There is a similar concept in object-oriented programming. Objects in memory—those that are in motion—must be differentiated from objects at rest—those that are stored in a repository.

Data in motion is about processing data as it flows through the system—aka streaming data. When you get a piece of information, such as a tuple, you need to decide what to do with it. You may have to keep a window of tuples available for processing. Imagine calculating a moving average. Keeping all the tuples that are part of the moving average and adjusting the window of tuples when a new one arrives is necessary.

Data in motion offers continuous processing that is as fast as possible, in part because everything processed is always in memory. It provides a real-time analytical paradigm to answer business needs that may require insights at sub-second response times. Further details in this area will emerge in upcoming installments of this column.

Comprehensive big data

Hopefully, this discussion makes the point that a comprehensive big data solution has to include the option to process data at rest and data in motion. IBM has a big data architecture that takes this option and more into account for a future-proof solution.

Please share your thoughts and questions about big data, and specifically data in motion, in the comments. And look for continued discussions online and in future articles.

[followbutton username='jroy58' count='false' lang='en' theme='light']
[followbutton username='IBMdatamag' count='false' lang='en' theme='light']