Blogs

Bridging Spark analytics to cloud data services

Post Comment
Big Data Evangelist, IBM

Apache Spark adds significant value to big data analytics projects. Spark’s sweet spot is as the workbench of choice for data science professionals, who interactively and iteratively explore, build and tune statistical models from data in the cloud. It is well suited for the next generation of machine-learning applications that require low-latency distributed computations.

Apply Spark to a range of projects

Data scientists use Spark to ask bigger, tougher questions and get better answers in less time than ever from their data. They also use it to iterate rapidly in model development and hypothesis testing, and rapidly deliver scalable big data analytics. In the process, data scientists can spend more time on delivery and innovation than was possible before, empower their organization with superior insights and deliver business value quickly from their projects.

Data science professionals often use Spark in many types of projects. One of its principal uses is in logical data warehousing, providing an in-memory environment for building fast data marts that leverage the Hadoop Distributed File System (HDFS) storage layer as well as other components of the Spark ecosystem. In data lake environments, data scientists use Spark to pull data from HDFS, NoSQL and other databases; transform and cleanse it in an Apache Hadoop cluster; and then bring it into machine-learning and interactive, statistical, exploration applications.

Regardless of the type of project in which data scientists deploy Spark, they typically leverage SparkSQL, its embedded machine-learning library (MLlib) and native query engine; Spark Streaming, its stream-computing engine; and/or GraphX, its graph analytics engine. In addition, data scientists may also choose to bridge their Spark implementations across a broad hybrid ecosystem of open source cloud data services.

Bridging Spark analytics to cloud data servicesDeploy Spark for multiple platforms

Where hybrid environments are concerned, Spark complements Hadoop, NoSQL, relational databases and every other functional component within the multiplatform world of cloud data services. Depending on the applications within which Spark is used, the cloud data services ecosystem may include any or all of the following components: 

A Spark workbench needs to accomplish several functions to deliver results within any type of data science project: 

  • Enable simplified access to both a user’s on-premises-based Spark deployments and other cloud data services and data platforms
  • Provide easy access to built-in, machine-learning libraries
  • Support fast, highly flexible and efficient coding and development of machine-learning analytics
  • Include extension libraries for SQL, DataFrames, streaming data, machine-learning and graph analysis
  • Be accessible through Jupyter Notebooks
  • Integrate with a broad tool ecosystem
  • Be easy to use, reliable, always on, fully managed, risk free, open and offer pay-as-you-grow capability

Try out Spark for three months

You can start today to achieve these benefits. Data science professionals are encouraged to sign up for a complimentary, three-month trial of IBM Analytics for Apache Spark and IBM Cloudant that offers a no-risk, pay-as-you-go on-ramp to the power of Spark.

In this extended cloud-based trial, you can access in-support, fast, in-memory analytics on Cloudant JavaScript Object Notation (JSON) data. And you can enjoy 20 hours of complimentary software-as-a-service (SaaS) Startup Advisory Services to help make the most of cloud data services in data science business initiatives.

Start the complimentary trial today. And check out a video introduction to IBM Analytics for Apache Spark and a tutorial on using the service’s embedded, machine-learning libraries.