Avoiding code bloat as data scientists embrace Spark

Big Data Evangelist, IBM

Big data analytics platforms are the core development workbenches of the new generation of developers. As I stated in a LinkedIn blog post last year, big data applications draw their vital force from the machine learning, cognitive computing and other analytics algorithm libraries that execute within massively parallel clusters.

As Apache Hadoop gives way to Apache Spark as the preferred big data analytics development platform, we need to realize that this fundamental imperative stays constant. The exploratory big data platform—call it a sandbox, a refinery, a lake or a reservoir—is where most Spark developers, also known as data scientists, will spend most of their productive hours. If you fail to provide them with a rich library of algorithms and models, you’ll make it difficult for them to uncover data-driven insights in deep-learning, graph, streaming and other advanced analytics projects that pull from Hadoop Distributed File System (HDFS) and other big data repositories. addition to the algorithm libraries themselves, Spark—in the five years since it was open sourced—has continued to add new execution engines that are optimized for superior performance in massively scalable big data clusters. MapReduce, the execution heart of Hadoop, increasingly feels like legacy technology because YARN was released almost three years ago. The same applies to Apache Mahout, which has been the principal machine-learning library that is optimized for MapReduce.

As I noted in my aforementioned blog post, the principal Spark machine learning library (MLlib) is starting to become ubiquitous, thanks to its incorporation into Spark development tools. But MLlib is far from the only library that Spark developers are relying on. As the recent blog post, "Spark Turns Five Years Old," by Matei Zaharia notes, the core Spark code base has also added other specialized libraries—namely, Spark Streaming,  GraphX and Spark SQL—to run on its core execution engine. MLlib and its sister libraries, most of which are less than two years old, can all interoperate smoothly and efficiently with each other in enabling fast parallel execution of the principal Spark-enabled data analytic functions.

Call this platform bloat, if you will, because, according to Zaharia, the lines of code in the libraries now dwarf those in the core Spark kernel by four to one. Not just that, but the libraries are expected to continue to bulk up in the coming years. “They also represent the single largest standard library available for big data,” Zaharia said, “making it easy to write applications that span all stages of the data lifecycle…. In future years, I expect these libraries to grow significantly, with the aim to build as rich a toolset for big data as the libraries available for small data.”

The volume of community contributions to the core Spark distribution over the past two years is no surprise, and Zaharia presents an interesting historical graph attesting to this trend. The pace should probably continue to grow asSpark is pressed into a growing range of big data analytics deployments. These deployments for Spark are expected to encompass cloud computing, streaming analytics, Internet of Things fog computing and more. New specialized algorithm and model libraries will almost certainly emerge from the Spark community to fully address this growing list of functional requirements.

All of these considerations raise an issue that may stall Spark’s seemingly inevitable drive toward universal adoption in big data analytics. As the convergence and evolution platform for many strands of big data analytics, does Spark risk becoming too bloated for its own good? Will Spark’s algorithm libraries, application programming interfaces (APIs), and tooling become too formidably complex for development of lightweight applications?

From the start, mainstream developer adoption of MapReduce and Hadoop has been hampered by the code base’s complexity. Will the embryonic Spark platform and tooling market learn from this recent history, or bog down its own overstuffed ambitions?

To learn more about Spark and the open data platform offering, please join IBM and our partners at Spark Summit, June 15–17, 2015 in San Francisco, California. 

Join fellow data scientists June 15, 2015 at Galvanize, San Francisco for a Spark community event. Hear how IBM and Spark are changing data science and propelling the insight economy. Sign up to attend in person, or watch the livestream and sign up for a reminder to receive notification on the day of the event.