Explore the true elasticity of Spark

Research Staff Member, IBM

How well can Apache Spark analytics engines respond to changing workload demands and resource availabilities? We will be speaking on this topic at Spark Summit 2015 on 15–17 June in San Francisco, California. This post gives you a high-level preview of that talk. We invite you to attend the presentation and to share your thoughts and experiences with elastic auto-scaling in Spark. scaling is an evolving best practice that will become ever-more critical as users increasingly rely on Spark for heavy-hitting analytic data processing. The ability of Spark engines to address these challenges will greatly determine their efficiency, usefulness and adoption in real-world deployments.

In terms of auto-scaling challenges that impact Spark price-performance, the chief metrics are the extent to which we can predict workload performance, boost the speed at which new resources are brought into clusters, and enable cost-effective analytic data processing. The Spark Summit presentation will discuss our study of the effectiveness of Spark elasticity when deployed on popular resource managers—namely, Mesos and YARN. Lessons from this work will enable tuning of an efficient auto-scaling infrastructure for Spark in cloud environments.

In particular, we’ve been investigating how well analytics workloads that run on Spark platforms behave as nodes are added and removed when Spark workloads are deployed on these clusters. We’ve examined the strengths and weaknesses of the principal Mesos and YARN resource managers for scaling.

For each of them, we evaluated Spark elasticity on such key metrics as workload runtime, CPU utilization and network bandwidth consumption. We also analyzed the impact of changing vital scheduling parameters (for example, locality wait time, resource re-offer interval) on these metrics.

Our baseline experimental setup was a 6-node cluster, with each node configured with 4 CPUs, 8 GB and 100 GB hard disk drives. Scaling up involved adding four new nodes “instantaneously.” Benchmarks included machine-learning algorithms such as KMeans, Page Rank and Logistic Regression, and SQL queries using Spark SQL.

What have we discovered so far? Significant findings include:

  • Similar benefits from auto-scaling can be realized with Mesos and YARN, enabling efficient, on-demand resource utilization.
  • Adding new nodes can significantly decrease Spark workload execution time, but the benefits greatly vary by workload. In particular, shuffle-intensive workloads with short running tasks such as Page Rank benefit less from adding new nodes due to increased cross-node traffic and low CPU utilization.
  • Factors contributing to scaling benefits include data locality preference of tasks and whether tasks are CPU-bound.
  • Simple enhancements to the Spark core can improve the Spark auto-scaling mechanism—for example, allowing for the dynamic adjustment of scheduling parameters such as locality wait time, and incorporating simple resource-monitoring mechanisms such as CPU utilization.

The presentation will include detailed benchmark numbers and the implications of your Spark deployments.

Begin learning more about Spark today, and register for Spark Summit 2015. Plus, check out IBM BigInsights 4.0, an enhanced solution with Spark support.  

Join fellow data scientists June 15, 2015 at Galvanize, San Francisco for a Spark community event. Hear how IBM and Spark are changing data science and propelling the Insight Economy. Sign up to attend in person, or watch the livestream and sign up for a reminder to receive notification on the day of the event.

Co-authored by Min Li, IBM Research staff member.