Analyzing time series data with stream processing and machine learning
Real-time detection and classification of signals or events present in time series data is a fairly common need. Stereotypical examples include identifying high-risk conditions in ICU data streams or classifying signals present in acoustic data from diagnostic or monitoring sensors. Using a combination of stream processing and machine learning is an agile and highly capable approach. It can effectively scale to large, fast data streams and adapt to evolving problem spaces.
The ability to effectively leverage the information contained in time series data has become more challenging as volumes and speeds have increased while at the same time the opportunities are greater than ever. This article describes a highly adaptable and scalable data-driven method for extracting relevant information. With very little modification, the same design pattern can be applied to a wide variety of application domains. Examples include classification of signals present in acoustic data, anomaly detection in maintenance telemetry or real-time recognition of important conditions in streaming medical data.
In situations where one or more conditions are already known, this approach can be applied to specifically detect those conditions. However, it can also be used for applications where the patterns are evolving over time (concept drift) or there is little or no advance knowledge of what signals or patterns might be present or significant. In this case, the approach can be used to both identify different categories of behavior and classify the current state into one of them. Yet another way it can be applied is to learn what is “normal” even if that tends to change over time, and then alert when something happens that is unusual (anomaly detection).
Accomplishing all the previously mentioned goals requires a data-driven approach that combines stream processing and machine learning. Stream processing is used to generate feature vectors (fingerprints) representing the current characteristics of the signal in a form that can be used by machine learning technologies. The machine learning part can use the feature vectors to build a model of the system behavior as well as score future input against that model. In certain cases, the model can also be used to predict future state.
One of the characteristics of this approach is that the techniques used for both the stream processing portion and the machine learning operation are not domain-specific. In other words, common general-purpose algorithms can be used to automatically work with different kinds of data. Generating the feature vectors involves combining standard numerical processing techniques such as Fourier Transformation (FT), Discrete Wavelet Transformation (DWT) and Cepstrum Analysis, along with various numerical or statistical values such as normalized root mean square (RMS). Although it is possible to use a single algorithm to generate a shorter feature vector, selectivity (the ability to discriminate between to similar signals) is greatly improved by joining multiple algorithms to produce a longer feature vector. Historically, this was problematic because of the computational load and volume of data—but modern tools such as IBM InfoSphere Streams can effectively scale to extremely large and fast data even for computationally intensive operations such as those described above. Similarly, the availability of cloud-hosted machine learning tools such as the open source Spark and the H2O-based Sparkling Water package running on IBM BigInsights now make it possible to address much larger data sets than ever.
Different types of machine learning
In situations where the pattern or signal you are looking for is already known, some form of supervised machine learning algorithm can be applied to the feature vectors. There is a wide range of supervised algorithms that may be applicable such as include State Vector Machines (SVM), Random Forest, Naïve Bayes and Neural Networks. Semi-supervised variations of these may also apply if there is a relatively small amount of training data available. Unsupervised techniques are appropriate when there is no preexisting set of labeled data on which to train, or in which the result is not initially known. Examples of unsupervised algorithms include K-means clustering (if the number of categories is known, or X-means if it isn’t), Hidden Markov models and Principal Component Analysis. Recently, Deep Learning has become very popular; it has the added benefit of being able to work in both unsupervised and supervised modes. Reinforcement training is yet another kind of machine learning but is more suited to fully autonomous applications.
Regardless of which type of machine learning is relevant to a particular use case, the feature vectors remain the same and would not necessarily have to change between different use-case domains.
Some of the more significant benefits of this approach are scalability, capability and adaptability. InfoSphere Streams can easily handle very large numbers of high data rate signals to generate the feature vectors, and cloud-based machine learning can handle the large volumes of data produced. Selectivity is enhanced because long feature vectors containing multiple representations and many calculated attributes can be used. And the design pattern is extremely adaptable to many different domains. That is, the same code can be used with very little modification for practically any type of time series data.
Learn more about IBM InfoSphere Streams and see how it is different from other stream processing platforms and ideally suited for enterprise-class, low-latency analytics. You can also check out the links above to get more information about Hadoop-based machine learning.