Data Scientist: Bringing True Science into the Business Process

Big Data Evangelist, IBM

If you think “data scientist” is a pretentious title, think again. Nothing could be more fundamental to science, to engineering, and to the continuous optimization of modern business processes.

So, you may ask, what is true science? And what exactly is a scientist? How can data scientists live up to this lofty ideal? Just as important, what legitimate role, if any, is there for true science in the conduct of business operations? And who else is essential, in terms of adjacent roles, in order to realize the full potential of data science in business?

Let’s start with that initial question. True science is simply the human activity of building and testing interpretive frameworks through controlled observation of real-world phenomena. That can be purely theoretical (i.e., with no real-world application intended) or purely practical (i.e., how science fits into in an engineering context). Or it may involve varying blends of theoretical depth and practical applications. In fact, many practitioners of business-oriented data science draw on the behavioral sciences—such as psychology, sociology and microeconomics—when exploring the factors that drive human behaviors such as customer churn, purchasing and recommending.

That takes us to the next question. Whether theoretical or practical in orientation, a true scientist is essentially an investigator. To be a scientist, you must investigate phenomena to whatever depth of factual validity is necessary. Put more broadly, your core job is three-fold. You must build descriptive, explanatory and predictive models of real-world phenomena in some specific domain. You must employ controlled observation procedures to acquire valid data pertinent to those phenomena. And you must use that data to iteratively test, confirm, and, if necessary, revise or discard those hypotheses.

No one doubts any of that. So let’s take this to the next level. Every scientist must be, at least in part, a data scientist. Without a fine-grained ability to sift, sort, structure, categorize, analyze and present data, the scientist can’t bring coherence to their inquiry into the factual substrate of reality. Just as critical, a scientist who hasn’t drilled down into the heart of their data—or done the deep statistical analyses needed to call out the correlations, causes and trends from their observations—can’t effectively present or defend their findings. When they’re using their data science skills—or collaborating with others who specialize in such things—scientists wield statistical algorithms, interactive exploration and advanced visualization tools to uncover non-obvious patterns in the data.

OK, fine, you say, traditional science has a clear dependence on data science. And you can, of course, do data science without any “pure science” (i.e., theoretical investigations without specific practical applications) intended. And data science can be true science as long as you adhere to a strict scientific discipline of data-driven hypothesis testing with procedural controls, independent verification, peer review and all of that.

So why do you need true science (of any degree of analytical/procedural rigor) in business? Well, that comes down to the fundamental issue of: How can you manage without a detailed command of how your external and internal environment—at all relevant scales—actually works? Specifically, you need a steady feed of verifiable facts and a nuanced understanding of all of the relevant predictive variables in order to understand how your actions might influence the future in your favor. You must have a fine-grained command of correlations, time series, trends, forecasts, segmentations and other patterns in order to understand the key causal factors at work.

All of that is fundamental to the vision of the data-driven business. And it all depends on your ability to collect, prepare, model and govern the data and analytic resources with scientific rigor. At the heart of any science are the controls that help you isolate the key explanatory factors from those with little or no impact on the dependent variables of greatest interest.

In addition to their modeling discipline, data scientists may also conduct real-world experiments. This emerging best practice involves data scientists making iterative changes to the analytic models and other decision logic artifacts that are embedded in operational applications and business process platforms. It also involves monitoring performance metrics with each run of the analytic-driven application in order to determine which specific piece of process logic—predictive analytic models, deterministic business rules, process orchestrations, etc.—contributed the most to desired outcomes. In this way, you can ensure a closed feedback loop in which processes are steadily and systematically improved from run to run, under the oversight of data scientists and process domain specialists.

So, clearly, data scientists are fundamental to these practices, but they needn’t go it alone. In a business environment, data scientists usually don’t work in isolation. As with any scientists, they rely on a wide range of people in adjacent roles to help them do their jobs as effectively as possible.

Think about science generally. In the historical development of modern science, the specialization of roles continues to proliferate. But today’s professional science establishment is a relatively recent phenomenon. Until the 20th century, most professional scientists had to build and maintain their own laboratories, invent and calibrate their own instruments, painstakingly record their own observations, and concoct and promote their own theories.

Today’s professional scientists—of which data scientists are a key category—have it much easier. Whether they work with particle accelerators or linear regression models, scientists know they don’t need to be their own chief cooks and bottle washers. They can make science their day job and rely on a host of others for all of the necessary supporting tools and infrastructure.

We find the following broad division of labor in all of today’s scientific disciplines, including data science:

  • Investigation. In their capacity as an investigator into statistical and predictive patterns in the data, the data scientist often works with subject-matter experts and business analysts. Their joint data explorations might be in marketing, sales, logistics, finance, fraud, process engineering or any other fields in which the domain expert and data scientist are knowledgeable.
  • Instrumentation. A true scientist uses instrumentation suited to the phenomena that they’re observing, modeling, testing and measuring. Without statistical modeling, predictive analysis and other tools, data scientists would not have the pattern-finding instrumentation on which they rely. Likewise, the underlying platform components—including data warehousing, visualization, integration and governance tools—are key pieces of the instrumentation that data scientists need for exploring deep data. Somebody has to provide all of these tools of the data scientist’s trade, hence the exploding ecosystem of big data solution providers such as IBM.
  • Institution. And a true scientist needs to make a steady living focusing on their investigations. The institutions that employ them may be public or private sector, nonprofit or commercial. The institutions that help them communicate and collaborate with other scientists may be professional associations, journals or other forums. Right now in data science, we see a huge push toward open source models of collaboration. This is most obvious in the area of open source platform/tool-focused communities such as Apache Hadoop and R, but it’s the trend in all collective areas of human investigation.

Business-oriented data scientists may never receive a Nobel Prize for their work, but that doesn’t make them any less scientific. The prize for their hard work is obvious: business success.

So, do you have any true data scientists in your organization?


What is a data scientist?

What is Hadoop?