Accelerating data applications with Jupyter Notebooks, Hadoop and Spark
An interview with Chris Snow
Chris Snow is a data and application architect at IBM, who loves helping customers with their data architectures. With over 20 years of IT experience working with business stakeholders at all levels, Snow’s experience spans banking, government, insurance, manufacturing, retail and telecommunications. He is currently focused on IBM Cloud Data Services and emerging technologies such as big data streaming architectures. In his spare time, Snow is the leader and a contributor on an open source project that provides executable examples for IBM BigInsights for Apache Hadoop. The examples kick-start your BigInsights projects, allowing you to move at warp speed on your big data use cases. The project can be found on GitHub.
The last time we spoke with you, we learned about your app development using BigInsights on Cloud and how it transforms the way you deliver new apps. Today, we are going to talk about your use of Jupyter Notebooks. How has Notebooks transformed the way people do, and deliver, advanced analytics and data science?
One of the key roles of today’s data scientists and other data practitioners is exploratory analysis—handling data sets that you don’t know much about and understanding what they are and what’s potentially valuable about them. Sometimes new data scientists and practitioners have a tendency to jump straight in and start running their data through a machine learning algorithm. However, the first and most important thing to do when you get hold of a new data set is to make some plots and visualize the data.
A famous example is Anscombe’s quartet—four data sets that have almost identical statistical properties when you look at their standard summary statistics such as mean, variance, x/y correlation and linear regression. But when you see the data in a graphical format (see it here), you realize that the four data sets are completely different.
That’s why visualization is so important; if you can see the shape of the data, you can start drawing some useful conclusions about what it means. And that’s the first reason why Jupyter Notebooks is such a great tool. Notebooks give people who work with data the ability to plot and visualize data sets very quickly and easily, see if they are skewed or have other problems, and decide how to work with the data to turn it into something valuable.
Those illustrations are quite helpful. What are the other advantages?
Notebooks have the potential to change the whole culture around the way data practitioners and data scientists report their results because they help you prove that your methods are sound and your work is reproducible. The code itself is all embedded in the notebook, where it can be audited and rerun against your data by other data users—and that capability means anyone who needs to work with the data, not just data scientists. Also, the notebooks are self-documenting because you can add narrative sections to explain what you did and why. Following your logic and checking your methods and assumptions is so easy for readers.
“Full integration between Jupyter Notebooks, Spark and Hadoop will enable you to build Spark jobs in a Notebook and run them directly against data in Hadoop. It’s going to be a huge step forward, because it will unlock the power of Notebooks on truly large datasets for the first time. That’s not just going to make life easier for data practitioners and data scientists—it is going to fundamentally change the types of analysis they do, and allow them to be much more ambitious and innovative.” —Chris Snow
Looking at a slightly longer-term view, notebooks very likely are going to spread far beyond the traditional domain of data science and data engineering. In the future, I think we’re going to see all kinds of business users benefiting from some type of Jupyter Notebooks application.
We’re hearing that even the line-of-business (LOB) users will work in Jupyter Notebooks environments. Moving on to the use of Spark, how does Spark add to the Notebooks experience?
The problem with some of the standard programming tools used in Jupyter Notebooks, such as Python and R, is that they don’t scale very well over multiple cores or processors. If you’re running a notebook against a data set that fits on your laptop, that approach is not really a problem. But if you’re working with a larger data set, or you’re doing something that involves a more sophisticated algorithm such as training a machine learning model, you’re quickly going to run into the limitations of the hardware. And you will whether that limitation is in terms of storage, RAM, CPU speed or all three.
Spark solves that problem. It distributes your data set across a cluster of computers, and using the processing power of each node of the cluster, it runs the algorithm on each part of the data set and then collects and collates the results. So you can do enormously complex calculations on extremely large amounts of data very quickly.
“Jupyter Notebooks are such a great tool—they give people who work with data the ability to plot and visualize datasets very quickly and easily, see if they are skewed or have other problems, and decide how to work with the data to turn it into something valuable.” —Chris Snow
Managed Hadoop services integrated with Spark, such as IBM BigInsights on Cloud, make this solution even easier because you don’t have to set up and manage your own Spark cluster. You can just upload your data and use a prebuilt Spark cluster on a software-as-a-service [SaaS] model, so you can focus on the analysis and let someone else worry about the infrastructure.
The commercial aspects of BigInsights on Cloud make it straightforward to get started as well for the basic plan, which is a simple pay-as-you-go model. And a more comprehensive enterprise subscription model is available for larger organizations that want to take their usage of these technologies to the next level.
How about if you’ve got a big data set in Hadoop; can Spark and Jupyter Notebooks be used?
Yes, you can load a data set from Hadoop into a Spark cluster and run a notebook on top, just like you can with any other data source. But what we’re really excited about is what’s coming next for BigInsights on Cloud: full integration among Jupyter Notebooks, Hadoop and Spark. It is expected to enable you to build Spark jobs in a notebook and run them directly against data in Hadoop without needing to move the data at all.
“Spark enables you to do enormously complex calculations on extremely large amounts of data very quickly. Spark integrated with Hadoop, or Hadoop as a service, make this even easier. …Once we make the connection between Notebooks, Spark and Hadoop, it could be the tipping point for truly wider adoption of big data technologies." —Chris Snow
One of the big advantages of a big data architecture such as Hadoop is that it eliminates loading times because you’re moving the analytics engine to where the data is, rather than bringing the data to the engine. If you have a huge data set on Hadoop, you want to be able to take advantage of that data locality by distributing Spark jobs to each node of the cluster and running them on each node’s data. IBM already makes this process possible through the Spark engine within its Hadoop distribution, IBM BigInsights. However, at the moment, to use Spark on BigInsights, you need to use a command-line interface [CLI] rather than a notebook. For data scientists, that approach is a little bit cumbersome compared to the instant insight they get from a notebook.
When we achieve full integration, what will that mean for data practitioners and people who are responsible for analytics delivery?
It’s going to be a huge step forward because for the first time it will really unlock the power of Jupyter Notebooks on truly large data sets. Instead of taking a small sample of a huge data set and moving it into Spark for analysis, or using complicated command-line instructions to look at the whole data set, you can explore the data set in its entirety from the ease and comfort of the Jupyter Notebooks interface.
That approach is not just going to make life easier for data practitioners, it is going to fundamentally change the types of analyses they do, and allow them to be much more ambitious and innovative. In fields such as machine learning, it’s also likely to make getting good results easier with less need to develop clever algorithms. Research suggests that you generally get better results from looking at a larger data set with a simple algorithm than you do from a smaller data set with a more sophisticated one.
And by making exploration of big data easy, very likely opportunities are going to open up for people who don’t have a formal data science or big data background to get involved in this kind of work. One of the big limiting factors in the uptake of Hadoop and Spark at the moment is that the skill sets required are relatively rare, and finding the right people is costly. Notebooks offer a gateway into big data that helps traditional business analysts take their first steps in exploring the value of types of data that they have never been able to analyze before. Once we make the connection among Jupyter Notebooks, Hadoop and Spark, truly wider adoption of big data technologies could very well be the tipping point.
If I’m potentially one of these new fledgling notebook users who wants to find out more about how IBM is bridging the gap between Jupyter Notebooks and Hadoop, where can I go to learn more?
First off, thousands of notebooks are available on Github. You can download and explore them to get an idea of the potential. Plus I’ll be doing a demo to show the potential of Jupyter Notebooks on BigInsights at IBM Insight at World of Watson 2016, 24–27 October 2016, at Mandalay Bay in Las Vegas, Nevada. If you’re planning to attend the conference, I’d love to see you there. And you can visit our websites for Big Insights and Spark to learn more about our commercial offerings around Hadoop and Spark.