Who uses Spark and why?

Manager of Portfolio Strategy, IBM

Earlier in 2015, as part of the launch of the Spark Technology Center, IBM declared Apache Spark the analytics operating system. Spark holds potential in three key areas: power of data, simplicity of design and speed of innovation.

Fast, powerful, simple trajectory

Spark unlocks the power of data by handling large-scale data with speed. It abstracts complexity of data access across countless landing zones such as the Hadoop Distributed File System (HDFS), relational databases, fast-moving data streams, distributed file systems and much more. This abstraction means that organizations can focus more on building data products and less on low-level details of data access.

In terms of design, the simplicity of Spark is badly needed. Organizations are using a wide variety of technologies to deliver analytics, and they tend to be tied to different workloads such as batch, iterative, business intelligence (BI) or real-time workloads. Each has different programming models, methods of data access and visualizations. Spark supports a number of declarative programming languages such as Python and Scala. This support, combined with the ability to work with a multitude of different data sources, helps make life much easier than ever for data science professionals. A number of integrated, high-level tools for machine learning and streaming data helps further reduce development time and helps foster building highly intelligent applications.

And Spark accelerates delivery of insight with in-memory processing across a distributed framework. Contrary to some views, Spark doesn’t just offer faster access to data in Apache Hadoop. The value of Spark lies in its ability to enable more people than ever to collaborate when accessing data, applying analytics and deploying deep intelligence in every type of application: Internet of Things, web, mobile, social, business process and others. uses Spark and why do they use it? Key users of Spark are data engineers, data scientists and application developers. And they use Spark for several important reasons: 

  •       Spark is open and accelerates community innovation.
  •       Spark is 100 times faster than MapReduce.
  •       Spark is about all data for large-scale data processing.
  •       Spark supports agile data science to iterate rapidly.

Data scientists

Among their responsibilities, data scientists need to identify patterns, trends, risks and opportunities in data. They need to tell a story with data and discover new, actionable insights. This group also builds new algorithms and models that can move data science into the application.

Spark helps data scientists by supporting the entire data science workflow, from data access and integration to machine learning and visualization using the language of choice—which is typically Python. It also provides a growing library of machine-learning algorithms through its machine-learning library (MLlib).

Data engineers

The role of data engineers is primarily to act as a bridge between the data scientist and the application developer. Data engineers implement machine-learning algorithms at scale and put the right data system—Hadoop, graph databases, Cloudant NoSQL, relational databases and streaming and in-memory data stores—to work for the job at hand.

Spark helps data engineers by providing the ability to abstract data access complexity—Spark doesn’t care what the data store is. It also enables near-real-time solutions at web scale, such as pipelined machine-learning workflows.

Application developers

Developers build applications that lever advanced analytics in partnership with the data scientist and the data engineer. They follow agile design methodologies, optimize performance and meet service-level agreements (SLAs)

Spark helps application developers through its support of widely used analytics application languages such as Python and Scala. It helps eliminate programming complexity by providing libraries such as MLlib, and it can simplify development operations (DevOps). Spark also makes embedding advanced analytics into applications easy.

Community movement

Spark is poised to move beyond a general processing framework. General classes of applications are moving to Spark, including compute-intensive applications and applications that require input from data streams such as sensors or social data. Compute-intensive applications can benefit from in-memory processing, and applications requiring streaming data tend to be intelligent and provide advanced analytics that can engage end users such as healthcare providers or equipment operators.

Take your Spark journey to the next step. IBM invites you to a free 3-month trial of IBM Analytics for Apache Spark and IBM Cloudant. Use Spark in the cloud to conduct fast in-memory analytics on your Cloudant JSON data. Sign up today and also receive free SaaS Startup Advisory Services to help you accelerate your time to results.