Answers to your burning questions about Apache Spark

Digital Marketing Manager, Big Data & Analytics, IBM

Widespread interest in Apache Spark was reflected in a recent CrowdChat. The discussion began with a simple question—what is Spark?—to get a sense of the audience’s understanding of Spark. This question turned out to be—not surprisingly—the most popular question among the six questions posed during the CrowdChat. By the end of the hour, the event reached nearly 3 million participants and close to 1,300 page views. Here is a summary of the questions and a few noteworthy responses.

What is Spark?

Essentially, Spark is a next-generation, cluster-computing solution; runtime processing environment; and development framework for in-memory advanced analytics. – James Kobielus, Big Data Evangelist at IBM

Apache Spark is an in-memory, distributed computing engine specifically designed to perform machine learning. – Himanshu Mehra, Software Developer at InfoObjects Inc.

At its most primitive and basic level, think of Spark as a way to distribute Python or Scala functions across a cluster. – Andrew C., President at Mammoth Data Inc. is Spark important?

Spark enables organizations to do more and deeper analytics with less coding and faster response times compared to typical MapReduce applications. – James Kobielus, Big Data Evangelist at IBM

To speed up the computing task. But how much performance gain can be achieved using Spark alongside Hadoop? – Thulasiram Valleru

Spark unlocks a whole new spectrum of business value that we have yet to fully comprehend. Its importance is equivalent to talking about Linux at the time and trying to predict Facebook – Joel Horwitz, Director of Portfolio Marketing, IBM Analytics Platform

What are the primary use cases for Spark, in both cross-industry and vertical industry applications?

We’re seeing Spark in finance, in particular, but in other industries as well. Anywhere that a single computer would take too long to execute a piece of functional code is a good candidate. – Andrew C., President at Mammoth Data Inc.

We’ve seen Spark adoption in financial services, healthcare, retail, energy and other industries. – Jason Schroedl, Vice President, Marketing at BlueData Inc.

Spark SQL is great for reading data from legacy databases and doing fast computation. – Ali Khanafer, Data Scientist

How does Spark build on and evolve users’ investments in Apache Hadoop, streaming, machine-learning and predictive analytics technologies?

The introduction of the DataFrame application programming interface (API) in Spark 1.3 is awesome and one step closer to the R API with powerful predictive analytics. – Ali Khanafer, Data Scientist

Spark accesses data lakes built on Hadoop’s Hadoop Distributed File System (HDFS), complements the low-latency streaming of a platform such as IBM Streams, and adds machine-learning libraries to big data analytics platforms that extend what is already in Hadoop and other analytics platforms. – James Kobielus, Big Data Evangelist at IBM

How do data scientists and data-driven application developers enhance their skills and upgrade their tools to make the most of Spark?

Making the most of Spark involves experience and experimentation that is critical to learning – Tim Crawford, Chief Information Officer (CIO) of AVOA, strategic advisor, board member and keynote speaker

This question is the critical impasse—how do you have the proper background in mathematics and business, and yet distributed computing to be effective? How do you acquire the domain knowledge? These questions represent the hard part—the sweat. – Andrew C., President, Mammoth Data Inc.

How will data science shape the future of business?

Data science will have impact on key decisions about everything from productivity to consumer behavior. – Avi Patwardhan, Product Marketing, IBM Analytics

Data science will play a critical role in continuing to define new buyer cohorts and markets, and refining target audiences to find out more about behaviors, needs and preferences at the individual level. – Kimberly Madia, Product Marketing Manager, IBM InfoSphere Streams, IBM Software

Learn more about Spark

If you missed this CrowdChat, a complete transcript is now available. In addition, another #SparkInsight CrowdChat will take place on June 9, 2015 at 11 a.m. ET. We look forward to your participation. Add this CrowdChat to your calendar today.