Streams of Fun - in Parallel and Real Time
As a kid, I used to love playing with puddles of water during rainy days (Unlike the millennials, we Gen-X did live closer to nature, didn’t we?). My favourite game was operating the “marbles cleaning factory.” The factory would get water from the puddle via my hand-constructed stream. It was quite easy – I would just use my hands to plough through the sand and a long stream would be ready. I would imagine my factory doing so well and a lot of imaginary customers would queue up. So I would have to open a second factory – and here is the part I used to love always. Then I would just fork the stream by ploughing another line and in a minute, the second factory was up and operating with water from the forked stream! My anxious imaginary customers would now clap and cheer so loudly!
In Knowesis, we built a real time big data solution using IBM Infosphere Streams for Telecommunication Service Providers. Our engine, called Sift, takes in streams of raw data from telecom network elements and compute a lot of behavioural indicators (e.g. the number of calls made over a month, average volume of data downloaded over a week etc).
All of these need to be computed in real time – as soon as the raw data is emitted out. On top of this, the system also throws out event triggers when a certain criteria is met (e.g. if the daily call volume exceeds the maximum number of calls so far). Such triggers are valuable for the telecom operators if they are available in real time. The traditional approaches to solving this scenario involve data flowing into data warehouses and massive amount of crunching being done on them. This mostly would take an elapsed duration of about 3 days, which was highly ineffective and a common gripe of the telecom marketing managers.
So we built this indicator computing engine pretty quickly. There is no magic here. Step one - Analyse and collect all the indicators that you want to build. Step 2 - Build them using java, mathematics (statistics) and software engineering best practices. After that, it takes a very little effort to convert your java components into Streams operators (more on it on a different post) and wire them all up using SPL (Streams Processing Language). We pumped calls from one end, and our engine computes indicators and throws events on the other end. It worked so predictably well. Until the time we wanted to handle more inputs – a huge amount of them!
Now, I am back to my childhood “marble cleaning factory” situation again. There is more demand! But the solution is pretty simple isn’t it? Just fork another stream and start a parallel factory to handle the additional demand.
This is when the InfoSphere Streams programming paradigm thrilled me. It took only a few minutes to change the codes to create another parallel stream and a few more minutes to declaratively tell the Streams RunTime where to run the newly forked factory.
Things worked predictably well again. To make the number of parallel factories dynamic, we used the Streams multimode capability to generate the codes based on an externally configured variable. The last step is a little sophisticated. But until that point in time, it was as easy as me ploughing through the sand to create a new stream.
In my long career as an IT Architect, I only remembered travelling through rogue waves. Technology was never ready for the critically big solutions that my customers wanted. I have always lived on the edge that way. With that background, you can imagine how thrilled I am to be able to create true real-time, parallel processing applications with such rapidity. More than anything, it gives me the same amount of fun as my “marbles cleaning factory” days. That you cannot replace with anything, can you? :)