Data science in the cloud
A year ago, I worked with large amounts of data stored on a server and on my laptop. Most of the data I used was in the form of text or binary files that were linked together with lookup tables. This method worked for me because I was the only one using the data, which meant I had to be very careful to not make any mistakes in the data analyses.
Since then a lot has changed. First of all, I changed jobs, from research fellow in academia to developer advocate for IBM Cloud Data Services. I made this change because I was fascinated to work with and learn more about data science in the cloud. As a developer advocate, I now get to play with all the new tools that IBM offers and write and talk about them.
The basics of the data science have not changed, but the tools I use have. I am still using Python most of the time, but where I store my data has become much simpler. For me, a typical workflow goes through the following steps:
- Define the question
- Find the data
- Explore the data and find the best tools for the analysis
- Clean and store the data
- Visualize and summarize the cleaned data
- Present the results
The first step, defining the question, is for me to come up with examples to show how to use new techniques and tools or to show how to work with interesting data sets. This approach is different when you work for a company that needs to solve problems or needs insights from their data to increase sales, for instance, or understand customer behavior. But the workflow is pretty much the same.
When you know what the question is that you want to answer, the time to look for the right data is now, and the data can be in any format and size and from many different sources. After a first, quick exploration of the data, you have to decide how to use it and, if needed, where to store it. Often I use an application programming interface (API) to collect data and then store it in a data warehouse such as dashDB, for example, if the data is structured. When the data is unstructured, Cloudant or one of the databases of Compose are easy places to store your data. All these databases have APIs that you can integrate into your code.
A time-consuming step is often the cleaning and munging of the data into a format that you can use to do the analysis. Things to sort out here are missing values, outliers, strange file formats and more. In a blog I wrote earlier in 2016, i discussed how you can find some examples using the pandas package in Python.
The fun part—the analysis of that data—begins when you have your data all cleaned up and stored in a database. For me, the analysis is a very interactive process of trying out different statistical methods and ways of presenting the data in graphs and maps. The analysis can be done in the cloud as well as in the IBM Data Science Experience. Here you can use Apache Spark for data analysis in Python and Scale notebooks or in RStudio. Collaborating and sharing your code is also easy.
The Python notebooks are my favorite for data analysis, and the main packages I use are Spark, for requests to load, select and summarize data from various cloud databases, and matplotlib to make figures—all in Python. Recently, I also started using a new Python package called pixiedust that is being developed by the developer advocacy team to quickly make different figures. Have a look at it; it is very easy to use and play with.
The final step is presenting your results in a report, presentation or blog. Here, I always try to incorporate the foregoing steps to create an appealing story.
I would love to learn and hear more about the workflow of other data scientists; let me know how you work. I can be reached on Twitter @margrietgr. Or, let’s have a chat at the IBM booth at the Spark Summit, 25–27 October 2016, in Brussels.