Boosting performance of open big data platforms

Distinguished Engineer, Power Systems, IBM

Ever hear of the buffalo theory? If not, a quick visit to a favorite search engine should provide the complete and entertaining story behind it. The first principle of the theory is that a herd of buffalo can only move as fast as the slowest buffalo in the herd. The slowest buffalo are the first ones targeted by predators and hunters, which enables a process of natural selection that improves the general speed and well being of the herd as a whole.

Another way of looking at the theory is an example of a means for bringing the herd into better balance. Similarly, optimizing the performance and scalability of a big data platform in many ways is analogous to culling the herd to continually identify and address the next barrier to performance or scalability. The theory is also about maintaining the correct balance between the many moving parts that make up the full solution going forward. The buffalo theory has added depth, and the second principle is best left as good fodder for a future blog. working in harmony

The volume of data and the rates at which new data is generated continue to rise steadily. The field of data science continues to advance, driving a rapid rate and pace of development for new analytics algorithms and easy-to-use distributed frameworks that can be used to mine new insights from all that data. This data explosion has resulted in a fairly complex collection of scale-out infrastructure and software components that all need to be managed, tuned and balanced to work well together.

The open source communities and projects are essentially doing their own culling from the herd, as new projects emerge to tackle the next obstacle or barrier that is preventing the next advancement from taking shape. The Apache Spark project is a very recent and obvious example of this emergence. Inspired by the desire to improve on the I/O performance limitations of Apache Hadoop and MapReduce, Spark leverages an in-memory architecture that strives to pipeline as many operations on the same copy of in-memory data as possible to minimize I/O.

The first performance and scalability challenge then is how to keep up with the latest open software progressions, adopting them and getting them to work together. This challenge is where the IBM Open Platform for Apache Hadoop can help. It provides a collection of the latest versions of Hadoop ecosystem components that have been tested, tuned and packaged for easy consumption. This collection also paves the way to exploit even more advanced big data and analytics software tools offered with IBM InfoSphere BigInsights.

The next performance and scalability challenge exists below the open software at the physical infrastructure—the full scale-out architecture that represents the compute, networking and storage sprawl. The critical approach for the physical infrastructure is balancing compute, memory, networking bandwidth and storage to meet the needs of big data analytics workloads. This approach isn’t really a new concept; performance analysis has commonly been an exercise in determining whether a workload is compute bound, memory bound or I/O bound and then finding the solution to alleviate the bottleneck. What is new is the complexity of doing this implementation in a rapidly evolving, open, big data, scale-out cluster environment.

Yesterday’s problem may have been how to efficiently and economically store and access vast quantities of data increasing at infinite scale. Today’s challenge may be the rate at which that data can be ingested, organized and understood. And tomorrow’s enigma may be how quickly new machine-learning or graph computation can occur on that data as new breakthroughs in data science and analytics are deployed.

Architecture that handles complex challenges

Scale-out architecture has many benefits including the ability to easily grow resources by adding nodes to a cluster, and doing so in a way that also naturally locates compute resources close to the data. It also provides new and economical availability models because the distributed computing frameworks that can cope with the loss of individual cluster nodes can handle the challenges of reliability and availability. Scale-out architecture can also present some challenges as well. While continuing to add more nodes to a cluster is easy in principle, unless all the resources of the individual nodes are being efficiently utilized, increasing imbalance begins to occur. The result can be wasted compute resources, saturated networks and unneeded management complexity.

The next best practice in big data platform performance and scalability is to maximize the performance and efficiency of the individual cluster nodes first, and then scale out with those well-tuned and balanced nodes. This practice can result in a much more efficient scale-out architecture that supports the same amount of work on much smaller clusters. The other benefit is an infrastructure that can adapt to and absorb the new pressures thrown at it when new workloads are deployed that may otherwise reveal a new bottleneck.

Again, consider Spark as a case in point. Its in-memory design, which can drastically improve performance by minimizing disk I/O, shifts more pressure to the compute, memory and networking resources and interconnects between them. Moreover, the Spark community itself has embarked on a new project, Tungsten, which is all about taking Spark performance to the next level by optimizing to the metal. In other words, the project seeks to bring Spark into better balance in its use of physical resources to address compute, cache and memory efficiency as solutions for the next bottlenecks to Spark performance. Tungsten is also motivated by advances in the physical infrastructure and hardware, along with a desire to exploit them from Spark—that is, optimized compilers, fast networks, flash memory, graphics processing units (GPUs) and field-programmable gate array (FPGA) accelerators.

Open ecosystem deployments

IBM has experienced firsthand the benefits of properly balancing the physical resources with its own deep analysis of Spark performance on IBM Power Systems. Additional details are available in a recent IBM developerWorks blog post, but the gist of it is how Spark benefits greatly from a good balance of compute thread density, large caches and ample memory and I/O bandwidth.

In regard to Spark’s vision for the future with Tungsten, another open ecosystem exists that could provide even more Spark metal optimizations. The OpenPOWER foundation is an open ecosystem that is similar to what exists for open source software but is focused on open hardware innovation. OpenPOWER partners are developing advanced new forms of coherent accelerators, networking and solid-state drive (SSD) and flash technologies to attach to Power Systems.

Learn more about Spark as a power tool for the modern data scientist. And be sure to register for IBM Insight 2015, October 26–29, 2015, in Las Vegas, Nevada, where you can see a presentation on the advantages of running Spark on Power Systems.