Real-time Analytics - Low Latency and High Velocity
Well there’s real-time, then there’s real-time, then there’s real-time. As so often with me, this post was first drafted on a plane, and when it comes to in-flight technology, real-time means very real-time. But after that, there’s pretty much a spectrum. Real-time fraud analytics for a credit card check is a few seconds at most; real-time for in-store market basket scoring to print a money-off coupon[i] is a few tenths of seconds. Real-time update of marketing campaigns to mobile targets might be a few minutes.
So the first question I always ask when discussing real-time analytics is, “Can you give me a frinstance?” – a concrete example to understand precisely the realness of the real-time. Then you know what the target is, and you can start to drill down on how idealized the target is. Everybody wants instant, but the real target is to get faster than the time after which it's too late or at which it will start detracting, not adding value. (A credit card check that took a minute would cost more in driven-away customers of that card than it could gain in better fraud detection). In reality, there will be a curve of diminishing value over time, so the target will depend on the cost. You’re probably thinking, “Oh great! Another variable in my technology choices!” But press on. It’s valuable data for making the decision.
The “frinstance” example gives you something else to start working on – the complexity of the analytics. You can generally score an instance against a model pretty quickly – you won’t have to access a mass of data to do that. Toronto Hospital for Sick Kids is a good example of this. They “score” readings from newborn babies to look for pattern that indicate if they are in the early stages, otherwise undetectable, of a blood infection.
But if you need to do market basket analysis on a month’s data first, you can forget it. And somewhere in between there are requirements that need, say, moving averages or percentiles. And if they need to be accurate (not approximations against last night’s data, for example) and if the data is very rapidly changing, then you have another profile of use case.
It’s a class of use cases that illustrates again why big data is not a one-size-fits-all problem. I’ve written recently about relational databases and Hadoop for data-at-rest analytics and about streaming data analytics. This category is somewhere between data at rest and data in motion. It’s typically very high velocity (for example, network instrumentation data or web logs) but the analytics requires more than “scoring” instances or short duration patterns against a model derived from batch analysis of the stored data. Where the arrival rate is very , very high, it affects the basis of the calculated model so that refreshing that by batch, maybe once a day, is too infrequent. In these cases, you need real-time analytics against all the data. And that’s a different class of use cases to the ones I’ve discussed so far.
I’ve tried to provide my own “frinstances” here, as ever. And I realize they are all stories I’ve told before, but I prefer to use publicly available stories over anonymized case histories and projects-in-progress. The nice thing is that the fund of available stories is ever growing, far faster then the list of projects I’m personally familiar with. I’ll just have to make time to research them all myself so I don’t get too boring.