Bridging Spark analytics to cloud data services
Apache Spark adds significant value to big data analytics projects. Spark’s sweet spot is as the workbench of choice for data science professionals, who interactively and iteratively explore, build and tune statistical models from data in the cloud. It is well suited for the next generation of machine-learning applications that require low-latency distributed computations.
Apply Spark to a range of projects
Data scientists use Spark to ask bigger, tougher questions and get better answers in less time than ever from their data. They also use it to iterate rapidly in model development and hypothesis testing, and rapidly deliver scalable big data analytics. In the process, data scientists can spend more time on delivery and innovation than was possible before, empower their organization with superior insights and deliver business value quickly from their projects.
Data science professionals often use Spark in many types of projects. One of its principal uses is in logical data warehousing, providing an in-memory environment for building fast data marts that leverage the Hadoop Distributed File System (HDFS) storage layer as well as other components of the Spark ecosystem. In data lake environments, data scientists use Spark to pull data from HDFS, NoSQL and other databases; transform and cleanse it in an Apache Hadoop cluster; and then bring it into machine-learning and interactive, statistical, exploration applications.
Regardless of the type of project in which data scientists deploy Spark, they typically leverage SparkSQL, its embedded machine-learning library (MLlib) and native query engine; Spark Streaming, its stream-computing engine; and/or GraphX, its graph analytics engine. In addition, data scientists may also choose to bridge their Spark implementations across a broad hybrid ecosystem of open source cloud data services.
Where hybrid environments are concerned, Spark complements Hadoop, NoSQL, relational databases and every other functional component within the multiplatform world of cloud data services. Depending on the applications within which Spark is used, the cloud data services ecosystem may include any or all of the following components:
- Hadoop cloud services accelerate enterprise-grade, advanced analytics built on open source technology.
- Data warehousing cloud services analyze data where it resides—in the cloud—with a fully managed columnar data warehouse service and leverage in-database predictive analytics and massively parallel processing (MPP) to do more with data.
- Graph cloud services enable high-powered storage, query and visualization of data points, their connections and their properties.
- NoSQL cloud database services move application data closer to all the places it needs to be for uninterrupted data access—offline or online.
- Multi-workload SQL cloud databases deliver performance and high availability for mission-critical applications and analytics.
- Open source database-as-a-service (DBaaS) platforms support web and mobile applications in a scalable manner.
- Cloud-based data preparation and movement services enable developers to access, combine and transform data.
- Predictive analytics cloud services optimize the future with better decisions today, deployed directly into business processes.
A Spark workbench needs to accomplish several functions to deliver results within any type of data science project:
- Enable simplified access to both a user’s on-premises-based Spark deployments and other cloud data services and data platforms
- Provide easy access to built-in, machine-learning libraries
- Support fast, highly flexible and efficient coding and development of machine-learning analytics
- Include extension libraries for SQL, DataFrames, streaming data, machine-learning and graph analysis
- Be accessible through Jupyter Notebooks
- Integrate with a broad tool ecosystem
- Be easy to use, reliable, always on, fully managed, risk free, open and offer pay-as-you-grow capability
Try out Spark for three months
You can start today to achieve these benefits. Data science professionals are encouraged to sign up for a complimentary, three-month trial of IBM Analytics for Apache Spark and IBM Cloudant that offers a no-risk, pay-as-you-go on-ramp to the power of Spark.