Blogs

An exciting data science experience on Hortonworks Data Platform

Post Comment
Director, Product Management, Hortonworks
Program Director, Offering Management, IBM

On June 13th 2017, Hortonworks and IBM announced an extension of our partnership. A key part of this partnership is the collaboration on IBM Data Science Experience (DSX). This collaboration is win-win in that it brings a production-ready full-cycle data science experience to Hortonworks Data Platform (HDP) customers and provides DSX customers access to information stored within HDP data lakes with an enterprise-grade compute grid. For most businesses, data is a key competitive differentiator. Increasingly, data science is employed to fully leverage this data. Data science remains a complex undertaking. Data scientists are asked to excel in multiple complex disciplines from data engineering and statistics to business domain. This challenge is substantial enough on small data; at the scale of big data it becomes difficult. 

At Hortonworks we’re excited about DSX, which supports the complete data science lifecycle. DSX helps data scientists bring their familiar tools such as Jupyter and RStudio, wrangle data, create complex machine learning models, and deploy these models to production.

From Small Data, Small Learning To Big Data, Big Learning

Many machine learning (ML) models work well with down sampling and with small data. But increasingly a large class of problems need big data for better predictions. Deep learning, for example, is more effective with big data. The combination of big data and big compute, provided by big data platforms such as HDP with Data Science Experience, will unleash big learning and make data science more accessible, scalable, and leverage all the enterprise data to make more accurate predictions.

Easier Data Science on Big Data

An increasing number of Hortonworks customers are moving to data science. Our customers leverage HDP to deliver machine learning use cases such as Churn Prediction, Predictive Maintenance to Optimizing Product Placement, and Store Layout.

Until now there has not been a unified tool for the complete data science lifecycle. Data science on big data meant a struggle with data movement, Kerberos, feature engineering using a plethora of tools including notebook setup, and ad hoc collaboration without a standard tool to deploy machine learning to production. Data science is such a fast moving field that many practitioners struggle to keep up with the latest advances. 

DSX fully addresses the entire data science lifecycle. It also provides a choice of notebooks, collaboration, tutorials, and the ability to deploy machine learning to production for Spark, R, Python and other ML languages. 

Our customers will now be able to leverage the compute provided by Apache Hadoop YARN to make more accurate predictions with all of the data stored in their enterprise data lake. 

DSX already includes RStudio and Jupyter as notebooks. IBM and Hortonworks are working together to include Apache Zeppelin within DSX. 

Apache Zeppelin will continue to be a part of HDP and we will increase our investments in Zeppelin and data science to offer a more robust and feature-rich platform. 

How to get started

Talk to your Hortonworks or IBM account teams about DSX and to see a demo.  In the next few months we’ll work to improve the DSX integration with HDP and deliver the first Technical Preview of DSX with HDP. This technical preview will provide a way to evaluate DSX with HDP in a non-production environment. Later this year we will offer support for DSX with a production HDP environment.

Conclusion

The golden age of data science is coming. DSX, now HDP, offers industry’s leading platform for data science on big data. We want to thank our customers and IBM for making this possible. We are very excited about the future and look forward to working with our customers and IBM to deliver the best data science experience on big data. 

Visit these pages to learn more about DSX and HDP.

See DSX and HDP at the Strata Data Conference, September 25 — 28, 2017 in New York City, New York.