Data Scientists: Key Programmers in the Convergence of Big Data, Cloud, Streaming and Internet of Things

Big Data Evangelist, IBM

You don’t often think of data scientists as “programmers,” per se, but they are the pivotal application developers in the age of big data.

What, after all, is a “programmer”? Wikipedia’s definition of “programming” is instructive: “the comprehensive process that leads from an original formulation of a computing problem to executable programs.”

The assumption that there’s a discrete class of “computing problems,” as distinct from “business problems,” is a bit misleading. The people who grapple with problems related to fundamental computing architectures are “computer scientists.” The people who write the code that harnesses some computing architecture to address business problems (e.g., payroll processing, order fulfillment, data management, etc.) are “computer programmers.”

What a programmer does, fundamentally, is specify the executable logic that underlies some computer-driven approach for addressing a business problem. Looked at in a business process management context, data scientists and other programmers are aligned with their counterparts who specify the process logic that drives orchestration, workflow, choreography, routing, or whatever else you want to call it.

The IT industry has proliferated diverse quasi-synonyms for the same phenomenon: the logic-driven flow of content, context and control throughout a distributed business process. This core definition of logic-flow developer is agnostic to such concerns as where the execution engines reside in logic-driven processes (intermediaries or endpoints or mix of both), what the endpoints of this process might be (databases, applications, humans, devices, or what have you), and so forth.

All of that explains why I’m not averse to categorizing data scientists, in a business context, as another type of programmer. Even if they’re only doing exploratory R&D into deep data sets, they are focused on tackling specific business problems. And to the extent they’re developing statistical models for deployment into marketing, manufacturing and other operational business applications, they are essentially specifying the structured, repeatable logic that drives business computing applications.

The key practical difference between data scientists and other programmers—including those who develop orchestration logic—is that the former specifies logic grounded in non-deterministic patterns (i.e., statistical models derived from propensities revealed inductively from historical data), whereas the latter specifies logic whose basis is predetermined (i.e., if/then/else, case-based and other rules, procedural and/or declarative, that were deduced from functional analysis of some problem domain).

The practical distinctions between data scientists and other programmers have always been a bit fuzzy, and they’re growing even blurrier over time. For starters, even a cursory glance at programming paradigms shows that core analytic functions—data handling and calculation—have always been the heart of programming. For another, many business applications leverage statistical analyses and other data-science models to drive transactional and other functions.

Furthermore, data scientists and other developers use a common set of programming languages, a fact driven home by this recent article. As the article states, the top programming languages for analytics, data mining and data science” (in descending order) are R, Python and SQL. Of these, only R is specific to statistical modeling, whereas Python and SQL are general-purpose. It’s no surprise that the highest growth in adoption is for Pig and Hive, which are key to MapReduce development in Hadoop environments, but it’s important to note that these compile to Java, Python, Ruby and other general-purpose programming languages.

Of course, data scientists differ from most other types of programmers in various ways that go beyond the deterministic vs. non-deterministic logic distinction mentioned above:

  • Data scientists have adopted analytic domain-specific languages such as R, SAS, SPSS and Matlab.
  • Data scientists specialize in business problems that are best addressed with statistical analysis.
  • Data scientists are often more aligned with specific business-application domains—such as marketing campaign optimization and financial risk mitigation—than the traditional programmer.

These distinctions primarily apply to what you might call the “classic” data scientist, such as multivariate statistical analysts and data mining professionals. But the notion of a “classic” data scientist might be rapidly fading away in the big-data era as more traditional programmers need some grounding in statistical modeling in order to do their jobs effectively—or, at the very least, need to collaborate productively with statistical modelers.

The convergence of disparate programming paradigms seems to be coming to a head, with data science at its heart. That thought came to me recently when I read a great IEEE Spectrum article on some new approach called “biologically inspired routing” in a “cognitive net.” In the article, author Antonio Liotta, professor of Communication Networks at Eindhoven (Netherlands) University of Technology, describes a network-routing-optimization approach that seems to depend on data-scientific development skills. It’s an approach that seems well-suited for continuous optimization of the complex logic flows that might bind big data analytics, cloud services, stream computing, and Internet of Things into a grand unified architecture.

The key tenet of biologically inspired routing, which merits consideration as a new general-purpose programming paradigm, is as follows: intelligent routing/forwarding can be executed on any and all network nodes, hardware and/or software, including both infrastructure nodes (e.g., traditional routers) and resource-constrained endpoints such as smartphones, environmental sensors, autonomous vehicles. With that in mind, the essential distribute flow pattern is as follows:

  • Embedded routing/forwarding engines in each node determine the best paths to queue and move data packets to their destinations
  • Each engine continually analyzes environmental and control signals from its own device and fed from engines on network-adjacent devices
  • Each engine has an autonomic controller that continuously absorbs environmental metrics, scores its own performance metrics, shares network knowledge with peers on adjacent devices, and uses data-driven machine learning—aka cognitive computing—to adjust its own execution plan accordingly.

Clearly, this approach blurs the distinction between deterministic and non-deterministic flow of control in a distributed network-based application. Liotta alludes to the breadth of this approach’s potential new-age applications in a passage that ends on a note that warms the hearts of we IBM-ers (well, I’ll speak for myself, at least):

  • “Scientists don’t know nearly enough about natural cognition to mimic it exactly. But advances in the field of machine learning—including pattern-recognition algorithms, statistical inference and trial-and-error learning techniques—are proving to be useful tools for network engineers. With these tools, it’s possible to create an Internet that can learn to juggle unfamiliar data flows or fight new malware attacks in a manner similar to the way a single computer might learn to recognize junk mail or play ‘Jeopardy!’”

Be that as it may, none of these sophisticated machine-learning algorithms will spontaneously emerge from the primordial goo of this converged new age. Under any foreseeable scenario, you will still need data scientists to build, monitor and tune all those sophisticated models.

In other words, you need new-age programmers with the ingenuity and skills of top-notch data scientists.