The 'Museum Clickers' of Big Data
A few weeks ago a colleague of mine over on the Acunu blog wrote a brief post illustrating the difference between batch analytics and real-time analytics – using the idea of a museum attendant with a clicker counting visitors, versus gathering up and counting all the entrance ticket stubs.
I liked the analogy and I started playing with it in my head. How about a count of leavers? Then I know how many people are in the museum. Or put clickers on each collection, so I know who’s been to see the Egyptian collection; and so on and so forth.
But of course some of these are actually batch-use cases. I am quite happy to wait to find out how many visitors we had today. And I’m happy to store the data so I can look at trends of visitor traffic and maybe correlate that with headline special exhibitions or whatever.
The key to Theo’s analogy, that I just diluted here, is the need to know now, not just the ability to know now. Would I do something different now, if I knew the answer to that question now?
Picking up where I left off in my previous post, let’s look at some real-time use cases to try to identify where an organization can find its own high-value, low-latency analytics opportunities. A typical one is monitoring and controlling a network of components. They might be:
- Masts, switches, etc. in a mobile telecomms network
- Producers, consumers, switches, transformers, connections etc. in a smart grid
- Stations, parts, operators, etc. in a manufacturing or assembly plant
In these cases, we’re monitoring status and exception events from all the components, raising alerts on out-of-tolerance values and typically visualising key measurables and trends in a dashboard. These are high-value use cases because we do something differently now in response to certain conditions in the data.
The key technology requirement is low-latency analysis on high-velocity data. This is typically challenging for databases – very high arrival-rate events (thousands per second and more), in parallel with queries on that rapidly changing data – even for OLTP databases or operational data stores (doable but at high cost). Other data stores (for example Hadoop with MapReduce) are designed for batch, so low-latency analytics (sub-seconds) is not a use case for them.
That leaves in-memory databases and NoSQL databases as candidates. It also leaves complex event processing (CEP) technologies, which are not strictly databases, since they don’t store the data, but do have the capability to handle high arrival rates.
So ... how to sort these sheep from the goats?
Mapping use cases onto technology solutions is never as clear cut as we’d like; there is always overlap. In this instance, it’s mostly a function of how much data you need to collate to answer the queries. If you just need to look for exceptions, or even short-term exceptional patterns, that you can define in advance, then you only need the data as it arrives: you assess it, act on it (if necessary) and discard it. CEPs are fine for this. You can get into a debate with a database vendor about whether their technology can do the job as well, or do it cheaper, and can also cover other use cases as well, but that doesn’t disqualify CEP, it just allows the database vendors to overlap.
If you need to look at historic data, for example, to compare with equivalent system behavior an hour ago, a day ago, a week ago, before you can decide if this is an exceptional event, then you need a database, and CEP alone won’t cut it for you. The decision about in-memory or NoSQL might depend simply on data volumes. Right now there are still limitations on in-memory database sizes, so if it’s really big data, you’ll be going NoSQL. If it’s not-so-big-data (a few terabytes say) you can do that in memory; cost might kick in to the equation and a disk-based, but memory-oriented NoSQL database might make just as good use of the extra memory spend anyway.
I started out thinking that high-arrival rate was going to be a differentiator, but actually it seems to be less pivotal than I thought. NoSQL databases (like Cassandra) were designed for it, likewise CEP; and in-memory databases don’t seem to struggle with high ingest rates, though sustaining high-arrival rates for a data store necessarily leads to high volumes of stored data – and that starts to stress in-memory databases.
Conclusion? There are lots of emerging real-time use cases across industries and neither conventional RDBMS nor Hadoop MapReduce meet them satisfactorily. So we appear to be entering the next phase of Big Data – Fast Data anyone?