How to build an all-purpose big data engine with Hadoop and Spark
Interview with Rohan Vaidyanathan and Niru Anisetti
Apache Hadoop and Apache Spark are complex technologies, and how to use these architectures together is often misunderstood by many organizations. Investing in both technologies enables a broad set of big data analytics and application development use cases.
Niru Anisetti is program director in the product management team for Spark offerings and next-generation big data platforms at IBM, and Rohan Vaidyanathan is senior offering manager at IBM and a leading light on the IBM Cloud Data Services team. Anisetti is an award-winning product specialist who has a background in software engineering and product development with experience in both Hadoop and Spark. And while working in the big data space during the past three years, Vaidyanathan has witnessed an explosion in the number and variety of organizations adopting big data technologies such as Hadoop and Spark. He’s also observed the recent trend to leverage data services in the cloud. We recently spoke to Anisetti and Vaidyanathan to explore some of the common misconceptions about Hadoop and Spark and help us understand the unique strengths of using the two architectures together.
When you speak to clients about big data technologies, do you get the impression they have a good understanding of the pros and cons of Hadoop and Spark?
Rohan Vaidyanathan: Many companies that have already invested in Hadoop, Spark or both absolutely know what they are doing. But a large group of organizations also exists that is on the edge of adopting big data, and a few key misconceptions are out there that often cause problems as these organizations start trying to define their new big data architecture.
For example, many articles online talk about Spark as a successor to Hadoop, or even as a replacement for Hadoop. Spark can execute a job 10 to 100 times faster than Hadoop, or to be precise, MapReduce; but there’s much more to Spark than just the runtime aspects of a cluster.
Niru Anisetti: My short response is, “no.” Many companies that we have started engaging with around Spark are just exploring analytics with their sample data. The best guess estimate at IBM is that around 90 percent of organizations are challenged to find analytics solutions with good return on investment (ROI) and are still in the planning stage. We need to do better at dispelling some of the myths and misunderstandings, if we’re going to help clients move forward on their big data journeys.
“Many companies that have already invested in Hadoop, Spark or both absolutely know what they are doing. But a large group of organizations also exists that is on the edge of adopting big data, and a few key misconceptions are out there that often cause problems as these organizations start trying to define their new big data architecture.” —Rohan Vaidyanathan
Then let’s try to clear up some of that confusion. How can we explain the real differences between Hadoop and Spark?
Anisetti: I like to use the analogy of a car. Spark is like a high-performance engine; it powers the work that you want to do with your data, and it can be bolted to all kinds of different chassis: data platforms such as object storage, IBM Cloudant or Hadoop. Hadoop can provide one of the possible storage layers that fuel the Spark engine with data.
Vaidyanathan: The key point is that Spark has no notion of storage within it. If you’re a data scientist and you’re using a Jupyter notebook to explore a small data set residing in an object store with Spark to do some ad hoc analysis, that’s fine. But what happens when you discover some exciting new way of gaining insight into that data, and you want to operationalize it on a massive scale with huge data sets and thousands of users? You need a data platform to ingest the data, store it, manage it and keep it secure. And you also need to add a robust framework for data governance to help you maintain quality and provide traceability.
Spark doesn’t provide those broader functionalities; it’s purely an engine for high-speed distributed data processing. Of course, Spark is an incredibly exciting technology and has all sorts of cool use cases, from stream processing to machine learning to real-time analytics, which is why we’re using it as an engine for more than 25 IBM products. But most real-world use cases also require additional capabilities such as governance, which means you need more than just Spark on its own.
And that’s where Hadoop comes in?
Vaidyanathan: Exactly. Hadoop is a broad ecosystem of open source components that aims to address almost every aspect of working with big data. That ecosystem includes data processing engines—most famously, MapReduce and now Spark—but it also includes projects such as Apache Ranger for security, for example.
Another common component that you see in almost all big data architectures, including those that leverage Spark, is Apache Hadoop Distributed File System (HDFS). As a scalable, flexible file storage platform for big data clusters, HDFS remains the go-to option when you have large quantities of file-based data that you want to analyze using Spark.
In addition, with the IBM distribution of Hadoop, IBM BigInsights, we can also provide enterprise-grade integration and data governance tools with IBM BigInsights BigIntegrate and IBM BigInsights BigQuality. These solutions bring the same powerful extract, transform and load (ETL) and quality management capabilities to the big data arena that we’ve tested extensively in hundreds of more traditional database and data warehouse environments over many years.
When you are using Spark, the results will only be as good as the input, so it’s vital to have good-quality, clean and accurate data before you begin your analysis. Combining tools such as BigIntegrate and BigQuality with Spark means you not only get answers quickly, but you can also be confident that those answers are correct.
“Spark is like a high-performance engine; it powers the work that you want to do with your data. Hadoop can provide one of the possible storage layers that fuel the Spark engine with data.” —Niru Anisetti
Can Spark be seen as a successor to MapReduce as a big data analytics engine, rather than as a successor to Hadoop as a whole?
Anisetti: That’s true to an extent; many use cases exist in which companies have traditionally used MapReduce and in which Spark really shines in comparison. For example, iterative procedures typically run many times faster on Spark than they did on MapReduce. MapReduce needs to read and write to a file with every iteration; whereas, Spark can keep the data in memory, run all the iterations and write the results to disk only when it has finished.
Besides performance, another couple of key reasons why Spark is often preferred to MapReduce is usability and portability. MapReduce jobs need to be coded in Java, which is a relatively low-level language, and they tend to be very sensitive to the specific configuration of the Hadoop cluster for which they were written. A job that works on one cluster may not run on another.
By contrast, Spark not only supports Java, but it also supports Python, R and Scala, which are much easier languages to learn and much more widely used by data scientists. Moreover, a Spark job can run in any Spark environment, from a laptop to the largest cluster. So it gives users much more flexibility and supports the classic data science lifecycle of moving from small-scale exploration to large-scale operationalization.
Is this milestone the end for MapReduce? Or are use cases available in which it still has advantages compared to Spark?
Vaidyanathan: Spark does emulate MapReduce with equal or better performance, and, as we discussed earlier, working with Spark is certainly easier. As a result, many people are choosing Spark for new workloads.
But it is not the end of MapReduce yet. A lot of applications that have been written today with MapReduce and associated infrastructure still exist, and moving to Spark would require making changes to those parts as well. A new application may be written in MapReduce or Spark based on what those underlying dependencies are.
The broader Hadoop ecosystem includes components such as Apache Cassandra, Apache HBase, Apache Hive and so on, many of which have started to adopt Spark. Until the Spark versions of these projects stabilize and become pervasive, we can expect to continue to see the use of MapReduce.
“If you need a high-speed data processing engine and you need to ingest, store and manage genuinely large and growing data sets, the combination of Hadoop and Spark is incredibly powerful. With Spark as an engine and the Hadoop ecosystem providing storage, governance and other important auxiliary capabilities, the door is open to solving a huge range of today’s big data challenges.” —Niru Anisetti
To sum up, how should companies think about the relationship between Hadoop and Spark when they are planning their big data architecture?
Vaidyanathan: If they’re serious about building a big data architecture, they almost certainly need both Hadoop and Spark. The choice doesn’t come down to deciding between the two; it’s about using both technologies to address the appropriate parts of the problem they are trying to solve.
Anisetti: If you need a high-speed data processing engine and you need to ingest, store and manage genuinely large and growing data sets, the combination of Hadoop and Spark is incredibly powerful. With Spark as an engine and the Hadoop ecosystem providing storage, governance and other important auxiliary capabilities, the door is open to solving a huge range of today’s big data challenges.