Answers to more of your burning questions about Spark

Digital Marketing Manager, Big Data & Analytics, IBM

Spark Technology has a lot of potential, we concluded after our second #SparkInsight CrowdChat on June 9. In fact, we think Spark is going to evolve into something really big.

Although we had discussed many features of Spark in our previous chat, we got the ball rolling by again starting off with a simple question: “What is Spark?” After that, there was no stopping it. We discussed seven questions touching on various pressing Spark-related topics and by the end of the hour had reached almost 3 million users and recorded nearly 1,500 page views.

This is what John Furrier, Cofounder of CrowdChat, had to say after the second #SparkInsight CrowdChat:

“This CrowdChat in our opinion will be a historic record of the state of the industry & minds of the leaders, influencers and subject matter experts. THANKS to everyone for participating.”

Here are the questions we discussed during the chat, along with a few popular responses we received: is Spark?

  • “An Apache project focusing on server cluster architecture.” – Ira Michael Blonder
  • “Spark is an in-memory distributed computing platform which essentially executes your functional Scala or Python code across the cluster as opposed to one machine. It also allows for micro-batching which tastes like streaming, but less filling.” – Andrew C. Oliver
  • “Spark is a in-memory distributed data processing engine, it is alternative to Map-Reduce framework. It has very rich data ingestion connectors and higher order functions for solve bigdata problems.” – Rahul Kumar

What are the important components of a Spark implementation?

  • “A good rational use case, expertise in Python and/or Scala and expertise in managing a large cluster (or cloud deployment). A lot of memory doesn't hurt :-)” – Andrew C. Oliver
  • “Clear separation of importing data (from SQL, HBase, etc) and distributed computation.” – Ali Khanafer

What is Spark’s optimal niche within diversified big data analytics ecosystems?

  • “The equivalent of the ETL slot in a SQL architecture. In other words after the warehouse and before the applications.” – Ira Michael Blonder
  • “Low-latency and streaming scenarios... theoretically across and blending multiple analysis modes—SQL, Graph, Machine Learning.” – Doug Henschen

How does Spark improve data scientist productivity?

  • “Spark improves data scientist productivity by enabling faster iterative development and refinement of statistical models fed by fresh continuous streams of low-latency data.” – James Kobielus
  • “Traditional tools don't allow data scientists to impact the business. We see a huge opportunity to scale the work of data scientists. Easy access to data has to be as easy as using Lotus 123 in early PC days.” – John Furrier
  • “The Spark REPL is an excellent way for data scientists to prototype solutions without having to submit code to the cluster all the time, leading to better feedback and iterative development.” – Ian Pointer

How mature is Spark as a big data analytics tool?

  • “It is getting there. Still do not have a large support community. It is often hard to get help with memory issues, etc.” – Ali Khanafer
  • “If you want maturity buy a mainframe. If you want high end capabilities and to solve problems that computers haven't been able to solve efficiently before, then maybe you want something more like Spark. In competitive industries you don't wait.” – Andrew C. Oliver
  • “Spark needs improvements in areas like security and integration with broader BI tools.” – Avi Patwardhan

What should you look for in a commercial Spark tool?

  • “You should look for an ODP-conformant Hadoop platform that includes the latest Spark version; for a high-performance in-memory distributed cluster to process Spark algorithms in massive parallelism; and for full vendor support & training.” – James Kobielus
  • “Since the history is not extensive IMO it is too early to set standards beyond the norm usually required by enterprise computing consumers.” – Ira Michael Blonder

How is Spark likely to evolve?

  • “With the traction already happening, Spark has a lot of potential so it good evolve to something really big.” – Mark van Rijmenam
  • “Well now that Gartner is looking at the solutions in the ‘big data’ space I think it's safe to say enterprise computing consumers are going to go there. So I think slow for next year or two and then a big move up.” – Ira Michael Blonder
  • “Spark will likely become the default data science engine/OS. It will be abstracted in the future for those with no SW dev background to still manage to do deep analytics.” – Ali Khanafer

If you’re interested in more insight surrounding Spark, read the transcript of the entire chat.

Interested in knowing more about Spark? Sign up to join us at Galvanize in San Francisco on June 15 for a Spark community event. Hear how IBM and Spark are changing data science and propelling the insight economy.

Can't attend in person? Watch the livestream and sign up for a reminder and to be notified on the day of the event.