Big Data: The Data Velocity Discussion
If there’s more and more data arriving and time isn’t expandingi, then data must be arriving at greater and greater velocity.
In my last post I talked about Variety in the Volume, Variety, Velocity triumvirate. There’s more to be said about that, but first I’d like to take a run at Velocity. We’ve got used to the idea that you load stuff into a database (or other data store) then you take a look at it. That’s just too slow for lots of operational decision making processes. And if you think about it, as the volume of data available increases the bar is constantly rising on real-time analysis. But for many kinds of decisions, you just need the data that comes with the event you want to decide about: is this a fraudulent transaction? Was this call dropped?
So why put it in a database first anyway? Why not assess the data and make the decision as it streams past in real time?
This is the reason why a big data strategy has to encompass the analysis of data in motion: Analysis of data in motion requires technology to evaluate and take action on data ‘events’ as they happen. And that might be thousand or hundreds of thousands of events per second. In most cases the events will require no action – typically analysis of data in motion is monitoring a stream looking for exceptions or exceptional patterns. But the rate at which the events occur, the velocity at which the data arrives means there is no time to store it before applying the analytics.
The ‘IBM data babies’ on youtube is a great customer story for IBM’s InfoSphere Streams productii. The Toronto Hospital for Sick Kids uses Streams to monitor all the neo-natals in real time. Now some of us may prefer to visualize Nurse in her crisp starched uniform and clipboard ticking off the cute, snuffling, sleepy babies one by one, but second-by-second monitoring is going to tell us there’s a problem a lot sooner in most cases, and Nurse can get on with something more urgent.
PNNL – Pacific Northwest National Labs is a joint Streams and Netezza story in the back end with a bunch of other IBM technology integrating the whole project. They analyze the events from instrumentation of their smart grid in real time looking for exceptions and failures of components in the grid and providing feedback to consumers to help them manage consumption. PNNL then capture the events in the Netezza box and analyze for patterns to make predictions about grid behavior. That’s an example of two uses of the same data, using two different big data technologies to get two different ‘insights’.
That's the key for Streams; you need to know now! Can’t wait to collect data, load it into the warehouse or Hadoop grid and analyze it.
But often you do want to load it into the warehouse (or Hadoop grid) afterward to get extra value from further analysis. In fact it’s a pattern of use case in itself – real-time analysis on streaming data and subsequent predictive analysis of the same data at rest. This comes up because Streams is scoring a model, live against the arriving data and the predictive analytics is seeking to improve the model so that Streams can become ‘smarter’ at monitoring the events stream. You develop the model and you analyze the collected data over time to refine the model - feeding back an improved model to Streams.
The Toronto Hospital for Sick Children used predictive analytics on the data collected from the babies to identify patterns that potentially indicated an infection developing, long before it was otherwise visible, the hospital then added that back into Streams and now believe they are now spotting some blood infections up to 24 hours sooner.
Before I finish, there is another place where Velocity gets talked about, which is Ad-hoc Analytics.
Ad Hoc Analytics is about complex analytic queries that access and analyze data in unplanned ways. These are not simple data retrieval queries that just need to be answered quickly and which are the bread and butter of operational data stores. I’m talking about sieving through potentially billions of records and applying complex stats to find the answer.
Conventional relational databases don’t like ad-hoc queries, but the IBM Netezza guys will always encourage the customer to throw ad-hoc queries in during the execution of their Proof of Concept (PoC) projects; queries they didn’t include in the PoC workload spec. That really throws competitors whose databases are tuned and indexed for a specific query workload. IBM Netezza’s unique architecture means it takes them in its stride.
One place this shows up in the real world is what data scientists call ‘speed of thought analysis’, where a data analyst tries something, gets a result, changes their ideas based on the result and works their way incrementally to the real insight they are looking for – which they may not have known they were looking for when they started. It’s a common work pattern for data analysts and BSkyB in the UK are a great story for this – up to 100 analysts doing ‘speed of thought’ analysis on their Netezza box. But of course this is for relational data only. Let’s come back to ad–hoc when we talk about unstructured data.
i - I didn’t check with Stephen Hawking, so I may be out of date on this one.
ii - and watch the IBM data babies behind the scenes cut if you want to know what IBM spends its marketing dollars on.