Blogs

Spark: The operating system for big data analytics

Post Comment
Lead Spark Engineer, Platfora

Apache Spark has enjoyed tremendous momentum over the past two years. A rising number of organizations have adopted it as a flexible and high-performance addition to their big data environments. Over that time, perceptions of Spark have evolved from an alternative to MapReduce to a big data processing engine and development environment to being described as an analytics operating system. Each perception is accurate, and the progression from one stage to the next demonstrates how quickly Spark has taken on a central role within many organizations’ big data strategies.

While recent Spark development has definitely reflected this trend, note the extent to which developments across the entire big data ecosystem will likely yield a more prominent role for Spark. For example, version 2.0 of MapReduce positions the technology as essentially just one more application that runs on Yet Another Resource Negotiator (YARN). Apache Hadoop—Hadoop Distributed File System (HDFS) plus MapReduce—is no longer the central technology. Instead, Hadoop now more accurately describes an entire ecosystem of technologies, with Spark emerging as the driving technology for the next generation of analytics within that ecosystem.

Empowering developers, enabling new applications 

Whatever analytics challenge a business faces, it typically finds it can meet the challenge more easily with Spark than without it. The reasons why arise directly from the technology’s fundamental architecture. Spark provides a well-defined core on which developers can build new layers of abstraction. These layers are known as Resilient Distributed Datasets (RDDs), and they provide an abstraction within distributed memory that enables programmers to perform fault-tolerant, in-memory computations across many machines.

In addition to providing a simple and straightforward way of reasoning about data sets, Spark can be easily embedded in specialized applications. For example, when various development and research teams first set out to build the machine learning, graph and SQL processing layers on top of Spark, they didn’t have to do it 100 percent from scratch. Spark’s abstraction layer was—and remains—flexible and expressive enough to enable all the different capabilities that the technology supports today. 

Contrast that capability with the raw MapReduce framework, which is not as conducive to different workloads. While solutions for graph, SQL, streaming and other workloads that Spark supports can be—and have been—developed on MapReduce, such development projects require significantly more work up front. And the resulting solutions are often more isolated, less conducive to interoperation and less performant than the equivalent solutions in Spark.

For example, while Apache Hive and Apache Mahout were both built on MapReduce as effective individual applications, they still remain completely different projects. Moving data between the two can be painful and time-consuming. In contrast, Spark developers can use a single model for both SQL and machine learning, graph and stream processing, or any other combination of modern big data workloads.

For application developers leveraging Spark, their resulting code is typically more concise than without leveraging Spark. This conciseness directly supports the software design philosophy that writing less code without compromising on functionality is always beneficial. Furthermore, the model is highly conducive to application-specific extensions because various components and programs can be glued together without developers needing to worry too much about how they’re glued together. Because Spark enables highly streamlined development in the present and a reduced cost of maintenance in the future, teams can spend more time focused on business and application requirements, rather than connecting and maintaining a large portfolio of disparate technologies. 

Applying Spark to new analytics use cases

Spark currently has a flexible and expressive Data Source application programming interface (API) that provides a simple abstraction for connecting to existing data management technologies. When it’s released, Spark 2.0 is expected to extend these capabilities by enabling streams of structured data to be ingested from other data management systems. With these extensions, Spark is enabling applications to blend historical and real-time analysis, thus opening up users to a wide new set of possible use cases that extend well beyond the typical big data analytics scenarios of even a year or two ago.

For example, consider a large organization that has a business group using a legacy relational database system (RDBMS) for transaction analysis, a group using MongoDB for customer analytics and a group using Apache Kafka for real-time fraud detection. To perform any holistic data analysis across all three business groups, the data needs to be collected, cleaned, curated and consumerized in an efficient and timely way. While legacy tooling can handle the RDBMS, and some newer systems may support both the RDBMS and MongoDB, that any of those systems can support the volume and latency guarantees that a specialized stream processing system will provide is unlikely.

This efficiency is where Spark really shines. Today, the legacy RDBMS and MongoDB systems can be accessed using Spark’s existing Data Source abstractions, while the Kafka stream can be accessed using Spark Streaming. Although SQL and Streaming are different systems today, the amount of code required to use both in a holistic analysis is certainly manageable.

Starting with Spark 2.0, however, all three of these sources will be accessible using the DataFrame and data set abstractions. By putting all the technologies under a single API, Spark can enable a reduction in complexity for the implementation and maintenance of major analytics projects.

Using simplicity as a differentiator 

Of course, MapReduce can ingest multistructured and real-time data sources from RDBMS, NoSQL and streaming systems. But it will involve a lot of baggage. The core MapReduce abstractions are not amenable to such a diverse set of processing paradigms, and the framework itself is not really optimized for handling iterative or near-real-time workloads at scale. As such, any real-world project will likely suffer from a long development and testing cycle; performance, reliability and latency concerns; and a high cost of maintenance.

The complexity mentioned previously isn’t just isolated to teams. It’s also quite a challenge for the one-off developer looking to pull things together easily and iterate on a data set for an analytics prototype or ad hoc analysis. With Spark, however, these lightweight projects are easy. Simply download Spark from the website, get the shell up and running and easily explore your data. In addition, you don’t have to make a lot of changes when you operationalize your analysis. Even though the scale and deployment change in production, the code can stay exactly the same. So the solution developed on a laptop can be deployed directly on a 1,000-node Hadoop cluster.

Spark’s rich abstractions ensure that, whatever the size of the data, the transformations will remain the same. Extract, transform and load (ETL) processing and data preparation into Hadoop are the same. Only the data volume changes.

Making a safe bet

Often, a new initiative is introduced on MapReduce with a big push, only to find that it wasn’t quite right. A few months later no one is using it, and you have to start back at square one. But with Spark, you have the flexibility that ensures that you can continue to work on what you have previously implemented. Plus, you have the support of a growing community of Spark users and developers.

Using Spark is a safe bet. It’s the least risky decision you can make. When selecting products to integrate into their business, organizations are increasingly looking for those that play nicely with Spark. In particular, they need solutions such as Platfora, which open up the power of Spark with guardrails and operational efficiencies, and enable organizations to focus on business problems, not infrastructure maintenance.

Spark is quickly becoming one of the central analytics technologies for many large and diverse organizations. It facilitates bringing data into or moving data out of whatever domain area you have. Real technical advancements around iterative processing combined with an easy overall environment and tool set for development makes all the difference. That combination sets Spark apart as not just an alternative to MapReduce, but a true operating system for big data analytics.

Take your Spark journey to the next step. IBM invites you to a complimentary three-month trial of IBM Analytics for Apache Spark and IBM Cloudant. Use Spark in the cloud to conduct fast in-memory analytics on Cloudant JavaScript Object Notation (JSON) data. Sign up today, and also receive complimentary software-as-a-service (SaaS)–based Startup Advisory Services to help you accelerate your time to results.