Practical applications for which Spark sparkles brightly
Recognizing when statistical modeling and exploration requirements align with the core capabilities of Apache Spark is important. Several sweet-spot Spark features are essential to many low-latency analytics applications in diverse industries and lines of business:
Spark’s core runtime engine is optimal for any analysis that requires a speed boost from parallelized execution entirely in memory, without the need to write out result sets after each pass through the data. We live in a world that is inexorably pushing toward larger pools of shared memory than ever across distributed clouds, streams and other platforms. Growing adoption of in-memory analytics and transactional technologies, such as IBM BLU Acceleration, show that there is no turning back to disk-centric computing platforms.
As the cost of dynamic random access memory continues to decline, in-memory processing is expected to become the predominant architecture for all users, uses and data. Big data increasingly occupies huge pools of virtualized memory that span many cloud-based servers. Throughout the industry, the price of solid-state persistence is expected to continue to decline, making petabyte-memory cloud platforms more feasible and budget-friendly over the coming decade.
And as next-generation processors with thousands of cores expand the addressable memory, we’re likely to see dozens of terabytes of RAM per server as a mainstream technology in that same time frame. In the world of big data, if you’re a data scientist building models in Spark, you may build a starter in-memory platform with 10–25 TB of priority core data. But you also may want the ability to scale it out over time as your investigations call for exploration of additional sources and modeling of increasing numbers of variables and scenarios.
The heart of modern low-latency, event-processing, mobility-enabling Internet of Things devices and data and other applications that operate on live, in-motion data is streaming analytics. It helps ensure a continuous flow of alerts, notifications, events, sensor data, transactions, social media updates, video and audio streams, and other types of information between all endpoints and infrastructure services. The requirement for continuous low-latency acquisition, transmission and analysis of disparate data demands stream-computing infrastructures that can execute massively parallel machine-learning algorithms throughout a distributed fabric.
The Spark streaming engine and application programming interface (API) support capture of data in motion and analysis in real-time. Bear in mind, though, that Spark streaming imposes latency and throughput restrictions because of its underlying reliance on minibatch processing rather than true stream computing. For low-latency use cases such as cybersecurity, mediation of telecommunications records or financial trading, a robust stream-computing platform such as IBM InfoSphere Streams is advised.
For anti-fraud processes, influence analysis, sentiment monitoring, market segmentation, engagement optimization and other applications in which complex patterns must be rapidly identified, graph analytics are fundamental. Graph analytics involves discovering, mapping and visualizing relationships among individuals, groups, terms, objects or other entities. Graph analysis is often used to mine behaviorally expressed connections, relationships and affinities among individuals, groups and other entities.
Modelers may incorporate any behavioral information into social graph models, including Facebook status updates, portal clickstreams, geospatial coordinates, smartphone-sourced mobile data, transaction records, interest profiles, call-detail records and usage logs. Depending on the amount of data, the complexity of models and the range of applications, graph analysis can require a considerable amount of processor, I/O bandwidth and other big data platform resources.
Spark’s GraphX graph-analysis engine is well suited for these challenges. It can associate records with vertices and edges in a graph and provide a collection of expressive, computational primitives. Using far less code than other graph abstractions, Spark’s data structures distribute graphs as tabular data structures for parallel, distributed, fault-tolerant, in-memory execution. It also enables end users to interactively load, transform and compute on massive graphs.
Machine learning analytics
For many advanced analytics use cases, machine learning is a tool that boosts data scientists’ productivity and can uncover hidden patterns that even the best data scientists may have overlooked. Machine learning models can enable analytics algorithms to learn from fresh feeds of data without constant human intervention and without explicit programming. The approach allows data scientists to train a model on an example data set, and then leverage algorithms that automatically generalize and learn both from that example and from fresh data feeds.
Spark also provides a scalable machine learning library (MLLib) that consists of common learning algorithms, utilities and APIs. Chief among these components are classification, collaborative filtering, dimensionality reduction, clustering and regression and underlying optimization primitives. These algorithms grow increasingly effective as the data accessed by Spark tools scale in volume, velocity and variety. Many Spark machine leaning applications—ranging from speech and facial recognition to clickstream processing, search-engine optimization (SEO) and recommendation engines—may be described as “sense-making analytics.” These applications involve continuous monitoring of feeds whose semantic patterns, context and importance need to be inferred from the stream.
The sense to be auto-extracted from streams, through the Spark MLLib, may be any blend of mobile, social, clickstream, sentiment and other statistical patterns. As such, Spark’s machine learning tools can offer a fundamental toolset for building a Smarter Planet that can sense and react to dynamic, distributed patterns—for example, terrorist activity, disaster response, weather conditions, traffic congestion and energy grids.
Clearly, Spark’s potential applications are wide ranging. Conceivably, many creative blends of social data analytics, mobile data analytics, Internet of Things data analytics and other advanced, leading-edge big data frontiers may benefit from Spark.
However, there’s no fixed cookbook or decision tree to follow in sifting through the field of potential applications. Data scientists simply need to keep their minds open to new Spark possibilities. For example, Spark is geared for high-performance, sensor-driven Internet of Things applications of all types.
Consider that every iPhone contains an accelerometer, which is the internal sensor that gauges the phone’s movement and tilt, and thereby automatically adjusts the orientation of how it renders the interface to the end user. Accelerometer-sourced data is very important both locally on the device—as in iPhones—and for location intelligence on the device. The data supports dynamic functions such as rendering and navigation and, through the Internet of Things, these and other services that drive real-time personalization, contextualization, experience management and the location intelligence. When parsing these accelerometer-fed Internet of Things applications down to their constituent data and analytics plumbing, you arrive at this set of Spark capabilities:
- iPhone data in in-memory analytics: The stringent real-time nature of in-motion applications demands that most data caching, storage and processing be done at RAM speed—both on the Internet of Things endpoints themselves and in the Internet of Things cloud that serves them.
- iPhone data in streaming analytics: The requirement for continuous low-latency acquisition, transmission and analysis of accelerometer and other device-level machine data demands stream computing infrastructures that can execute massively parallel, machine-learning algorithms throughout a distributed fabric.
- iPhone data in graph analytics: The need for contextualization of accelerator and other device-level sensor data in geospatial, behavioral, event, interest, experience and social graphs requires a data fabric that can process complex edge-and-node graph data structures.
This high-performance, sensor-driven Internet of Things application [OK? Otherwise, please clarify the ambiguous pronoun] is a low-hanging fruit project for Spark to make a real difference in the fabric of a mobile, connected and global society. Not surprisingly, a creative data scientist in France is already hard at work applying Spark in a project to develop smartphone-user, physical-recognition algorithms that leverage Internet of Things data in real time. I’d be surprised if others everywhere are not doing equally brilliant work using this and other data to deliver unprecedented new mobility and Internet of Things applications.
Spark sparks imagination. Please let us know what disruptive innovations it can unleash in your organization. And if you’re hungry for more information on Spark, get started learning Spark today, by registering for Spark Summit in San Francisco, California, June 15–17, 2015.