Spark and Hadoop: Taking big data to the next level

Vice President of Product, Platfora

Spark is a key contributor to the evolution of the big data landscape, but it is important to put its growth into the appropriate context.

As businesses continue to expand their big data footprints, what they can accomplish using Hadoop becomes an increasingly urgent question. In opening up new capabilities within Hadoop, Spark demonstrates how rapidly both business and technology requirements are evolving. The newly released version 1.4 provides a major step toward recognition of Spark as the standard in-memory cluster computing framework. release includes important advances toward making Spark more accessible to data scientists and analysts alike while also adding speed and stability for enterprise operations. For example, Spark now supports the R statistical programming language via SparkR, reflecting the growth in demand for R in enterprise settings. Quantitative analysts who used to work in SAS or MATLAB are flocking to R because of its openness and community of users. SparkR provides an on ramp for those who use R to work in the world of big data.

Spark has also expanded its machine learning capabilities, allowing organizations to assemble and execute more complex pipelines than ever. Additionally, in new releases, the MLLib and GraphX libraries are increasingly complete, with included algorithms. At the same time, the ability to use popular windowing functions in Spark SQL is useful for business users wanting to look at data over specified periods, such as through year-over-year analysis.

Across the board, Spark has been driving major shifts in how businesses that rely on Hadoop do big data: making advanced analytics a reality with out-of-the-box capabilities, simplifying technical proficiency requirements from having expertise in MapReduce and Java to having a basic understanding of database and scripting, opening up new options for SQL access of Hadoop data, delivering faster results and eliminating concerns about which Hadoop distribution a business uses. All these developments can be mapped to the increasing maturity of the Hadoop ecosystem and the increasing centrality, and criticality, of big data capabilities within the enterprise.

Going negative on Hadoop?

Accordingly, it was surprising for many to hear that Gartner had “gone negative” on the Hadoop market in its recent research, which describes a downward trend in expectations for Hadoop. However, Gartner’s assessment of where things are going is a lot more upbeat than the phrase “Trough of Disillusionment” might imply to those unfamiliar with Gartner’s model. As Gartner explains it, the trough is “a stage all emerging technologies must go through before leveling off into broad deployment.” It reflects the shift in expectations that occurs as a technology evolves from being speculative and experimental to become a broadly operational framework. In other words, reality must begin to kick in.

Moreover, putting any consideration of inflated expectations aside, the adoption numbers show that the Hadoop glass is half full—at the very least. Unfortunately, such numbers are hard to digest within a media cycle that feeds mainly on hype. Meanwhile, Apache Spark is newer than Hadoop proper and is upstream from it in terms of both hype and adoption. Considering the arc Spark has followed from obscurity to being a core technology for big data—all in a surprisingly short time—Spark clearly deserves the attention it is getting. But Spark’s growth needs to be understood in the context of the ecosystem in which it is occurring lest we misinterpret what is happening.

One unfortunate narrative claims that Spark’s growth is somehow coming at the expense of the growth of Hadoop. We even hear talk about “replacing” Hadoop with Spark. Of course, Spark no more replaces Hadoop than a transmission replaces a car. Have you ever heard a description of how someone has stopped using Excel “because now we use pivot tables”? More accurately, someone has stopped using standard Excel worksheets because pivot tables provide even more capability. But that doesn’t meant the person is no longer using Excel; rather, it means using Excel to do things not possible in regular Excel worksheets. Likewise, many businesses are now using Spark to do things they couldn’t using other parts of the Hadoop infrastructure, such as MapReduce.

The difference Spark makes

The boost that Spark is providing is the latest in a series of milestones that Hadoop has cleared along the way. First came the establishment of the Hadoop Distributed File System (HDFS) as the right storage platform for big data. Next came the recognition of YARN as the resource allocation and management framework of choice for big data environments. Then came the realization that no single processing framework will solve every problem. MapReduce is a groundbreaking technology, but it doesn’t address every situation.

So along came Spark.

Far from replacing Hadoop, Spark is continuing to enhance Hadoop environments, making them stronger and more capable. In fact, Spark is one of the key enablers of Hadoop’s journey from the Trough of Disillusionment to the Plateau of Productivity. Together, Spark and Hadoop are taking big data to the next level.

Visit Platfora to learn how anyone can access, transform and analyze data in Hadoop without writing a line of code.