What it really takes to be a data scientist
As a long-standing statistical modeling professional, I’ve seen what I do for a living go by many names. In the past few years, the term data science has come into vogue. I know a lot of definitions are out there, but here’s a simple one. I define data science as the ability to predict the near-term future or identify otherwise unknown facts about the present based on patterns from what has happened in the recent past.
The science of patterns and data mining
Data science is not a crystal ball; it won’t predict a very rare event or something that has never happened before. It also won’t provide accurate predictions if patterns change radically—the inability of models to predict the financial crisis of 2008 is a perfect example. In most cases though, patterns of behavior do not often change radically in a short amount of time. This characteristic means that a computer using advanced mathematical techniques can discern the patterns in data and create equations or rules that describe that pattern.
Flowing new data through the model is deploying the model, and dozens of uses cases exist, from marketing to predictive maintenance to fraud detection to recruiting and retention and many others. The value found from deploying machine learning models has often been enormous, with paybacks in savings many times the initial investment. When I started in this field a number of years ago, this capability was enabled through data mining workbenches, and the people who deployed these models were known as data miners. Their job was to mine data for these otherwise hidden patterns.
Unfortunately, data mining received a negative connotation—it was mistakenly confused with data collection—and thus the industry started to refer to these practices as predictive analytics. If you are not an expert, you may simply treat these three terms as having the same functional scope. Consequently, you may regard a tool such as IBM SPSS Modeler as data science, predictive analytics and a data mining workbench.
To avoid confusion, I prefer to use the term data science to refer to them all, with a growing focus on the use of the machine learning capabilities of these tools. By the same token, referring to their core users as data scientists seems to work best. Rather than quibble over definitions, we instead need to concern ourselves with what data scientists do for a living and, consequently, what they need in their tools, platforms and collaboration environments to be as effective as possible at their jobs.
In that regard, I’d like to focus on the core issue of the primary skills that a professional data scientist needs. Some contend that data science can only be done through code and that a data scientist has to be an expert programmer and even have a Ph.D. However, I think many ways to approach data science are available.
As easy as flying an airplane
I like to use an analogy that I think aptly describes what is going on with data science. Consider flying a small airplane. Being an airplane pilot requires knowledge and expertise—one does not simply get into an airplane with no experience and knowledge and try to fly it. Ideally, a pilot receives both classroom training on the principles of flight and practical experience under supervision using all the buttons, levers, and instrumentation to fly the plane correctly.
What if a job description for a trained pilot was not just about the license or flight experience but also knowing how to design a new type of airplane from fundamental principles? And what if that job description also requires knowing how to actually put together an airplane, either from raw materials or from certain types of prebuilt modules? I bet most people would say being an airplane pilot should not require knowing how to design a new type of airplane or knowing how to build an airplane.
I would argue that doing data science in practice is a lot like being an airplane pilot. The data scientist needs knowledge and expertise to know what to do, how to prepare data and how to create models. However, what is really needed is knowing which techniques to know. In the vast majority of cases, the data scientist can take advantage of existing machine learning techniques and does not need a Ph.D., which is necessary to create new techniques.
And depending on the person, the data scientist may find it faster to use sophisticated software that has a lot of capabilities versus trying to build whole new software through code. Therefore, I would argue that restricting data scientists to Ph.Ds who know how to code basically artificially limits who can take advantage of this field. Indeed, many organizations who have deployed models created in IBM SPSS Modeler have obtained tremendous return on investments (ROIs) from their results without having to use code to create the models or have Ph.Ds work on the efforts.
The Data Science Experience
In recent years, open source code in Python and R has garnered a lot of popularity because these languages are quite flexible and powerful, and no cost to purchase software is necessary to work in Python or R. However, not everyone is a programmer or wants to program. In my own case, I first learned about statistics using a code-based software package. I found myself frustrated by the time required to properly create the code and ensure it was working correctly.
When I was first exposed to what became IBM SPSS Modeler, I found that it was much easier and quicker to get my needed analysis done using Python or R. IBM announced IBM Data Science Experience (DSX) earlier this year as a way for data scientist coders to collaborate and work on data science programs in the most efficient way possible. What if the collaboration could be extended to the data scientist or analyst who wants to build predictive models but without code? DSX enables data scientists to choose their own preferred way to tackle the problem.
Gain more information on DSX and how to participate in the open beta. And to hear what Jane Hendricks and I have to say on this topic, please register for our sessions—1693A or 1693B—at IBM Insight at World of Watson 2016.