Pushing the MapReduce Efficiency Envelope
Suitable frameworks help data scientists apply statistical models for increasingly hybridized big data infrastructures
One of the clichés of 1950s science-fiction B movies is the arrogant robot declaring that some stupid human suggestion “does not compute.” Or maybe that cliché was from the Lost in Space television series that ran three seasons from 1965 to 1968. Regardless, the concept that something does not compute collides head-on with humanity’s implicit faith in computing’s applicability to any problem that’s reducible to data and algorithms. Nevertheless, does not compute aligns with several notions with which most real-world computer scientists are familiar. The bottom line is whether a system is conceivably optimized for a specific algorithmic workload.
Finding the crux of the problem
For starters, those familiar with the theoretical computer science concept of a Turing machine1 know that the ability to address a problem algorithmically doesn’t mean the problem is definitively solvable that way. The term used to describe such problems is undecidable.2 Their undecidability—in other words, the fact that constructing a single algorithm that always leads to a correct yes or no answer to that problem is impossible—represents an intrinsic property of the problem. The property of undecidability is what can be established with certainty through logic and mathematics. That property cannot be changed, no matter which algorithms or how much computing power is thrown at the problem.
Another sense of does not compute is the notion that a problem’s algorithmic solution is infeasible. This infeasibility means the problem may be theoretically decidable, but its definitive solution may require more processor time than the universe is old, or far more processing, memory, storage, and I/O resources than exist on earth at this time. Conjuring a not-entirely impossible science-fiction scenario—perhaps involving quantum analytics—may make a solution algorithmically feasible at some point in the future. But people who think they stand a decent chance of building, much less optimizing, some magical cloud-based resource to process an algorithmic workload would be fooling themselves.
Yet another sense of does not compute is the concept of inefficient. A problem may be quite decidable and have a feasible algorithmic solution using available technology. But the algorithm may be so resource-consumptive, requiring an inordinate amount of processor, memory, storage, and bandwidth to execute satisfactorily, that executing it would cost more than the expected benefit received. A cost-effective algorithmic solution would be unlikely to find until the IT economics improve, the organization upgrades to a more efficient hardware platform than it has, or the efficiency of the algorithms tuned for the platform improve.
When considering the efficiency of an algorithm, it’s easy to overlook a key constraint of the platform that is not directly related to the underlying hardware platform. If the code was written to a software framework that constrains the efficiency of the algorithm, pushing the algorithmic workloads to the maximum performance that is theoretically possible on any given hardware platform won’t be possible. If the software framework imposes unnecessary complexities that impact workload performance, the developers’ hands are tied.
Matching the framework to the statistical model
In the world of big data, the most prevalent software development framework is Apache MapReduce.3 Though it’s not the only distributed-processing programming model for big data, MapReduce imposes its own intrinsic complexities on any algorithm that’s been written in it. Consequently, even when scaling out MapReduce code on the biggest, fastest, most massively parallel cloud platform, these inherent complexities in the programming code act as a drag on algorithmic performance.
With the foregoing in mind, I took great interest in Jeremy Kun’s article, “On the Computational Complexity of MapReduce,” which takes a deep dive into the computational complexities of MapReduce.4 The article is thick with advanced mathematics, much of which is completely over my head. But Kun’s core research questions are right to the point, calling attention to the intrinsic distributed-processing design of an algorithm written in MapReduce or any other framework. For example, Kun asks, “What computations are possible in a model of parallel computation where no processor has enough space to store even one-thousandth of the input?”
What Kun is exploring mathematically are the limits of MapReduce from an algorithmic feasibility and workload efficiency standpoint. By implication, he’s creating an analytical framework for assessing how either MapReduce can evolve to push these boundaries or developers can apply alternative programming models to big data analytics workloads, for which MapReduce is inherently limited.
I’d like to see a decision-support framework that compares and contrasts MapReduce and alternative computation platforms—such as Apache Spark—according to the most important technical criteria. These criteria would include their comparative applicability to various problem domains, their performance and scalability on various parallel-processing platforms, their interpretability and elegance in delivering insights, and so forth.
As big data evolves into increasingly hybridized infrastructures that distribute computations across MapReduce, Spark, and other platforms, data scientists will need help in deciding which environment is best suited for developing and executing each type of statistical model they create.
Please share any thoughts or questions in the comments.
1 “Computing Machinery and Intelligence,” by A. M. Turing, Mind, 1950.
2 “Turing’s Undecidability Theorem,” Encyclopedia Britannica, Inc.
3 “What is MapReduce?” Hadoop website at IBM.com.
4 “On the Computational Complexity of MapReduce,” by Jeremy Kun, Math Programming blog, October 2014.
|[followbutton username='jameskobielus' count='false' lang='en' theme='light']|
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|