Blogs

Experience the power of big data with Apache Spark and Cloud Pak for Data

Offering Manager at IBM Cloud Pak for Data, IBM

Data is all around us. By 2020, IDC expects the entire store of data to be as large as 44 zettabytes, amounting to a single bit of data for every star in the physical universe. Clearly, businesses have a lot of data, and it will continue to grow. We now need a method for storing increasing amounts of data at scale, and a way to economically digest all this data with a quick feedback loop.

Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. Apache Spark provides tight feedback loops and allows to process multiple queries quickly, with little overhead in a cost-effective manner.

In the latest release of IBM Cloud Pak for Data, v2.5 has three key themes: Red Hat integration, new key built-in capabilities such as Watson tools and runtimes, and a heavy focus on open source. Open source is widely adopted in enterprise business, especially as products mature and vendors extend their reach. IBM is expanding the support for open source technologies to our enterprise clients and ensuring the governance of all open source within a business.

Cloud Pak for Data v2.5 provides Analytics Engine, powered by Apache Spark service, integrated in the platform to give users a holistic experience to run serverless Spark jobs and take advantage of access control, logging, monitoring, and many other integrated services. Analytics Engine can be used on the Cloud Pak for Data cluster to run a variety of independent workloads through Watson Studio—without the need of using Jupyter notebooks. It can be used to run Spark applications that run Spark SQL, including data transformation jobs, data science jobs, and machine learning jobs.

Analytics Engine also provides an integrated experience for users to run and manage different variations of analytics in a single platform. This enables consistent and predictable performance of Spark jobs with dedicated resources.

A number of APIs are available to submit Spark jobs, and to help you manage workflow and logs. You can easily access and track running jobs in the user interface of Cloud Pak for Data along with other tasks running on the cluster. You can take advantage of push notifications to get instantly alerted about any failures with Spark jobs, and its interface is user-friendly.

The Analytics Engine service is 100-percent open source-supported, consistent with our core commitment to open source tools and platforms. Check out the infographic to learn more about open source on Cloud Pak for Data, or watch the on-demand webinar. If you’re ready to try the platform for yourself with you own data, consider a 7-day trial at no cost.