Boosting the productivity of the next-generation data scientist

Big Data Evangelist, IBM

Data science has become an axis of the 21st-century economy, and to no one’s surprise: Data scientists are key developers in the era of cognitive computing and open data. Indeed, their core focus is on developing repeatable data application artifacts—for example, machine learning and other statistical models—for deployment within always-on business environments.

Boosting data scientists’ productivity

Data scientists’ productivity can get a boost from teams that offer the right mix of individuals who have diverse aptitudes, skills and roles, but such teams must build processes and collaboration environments that can help accelerate repeatable pipelines of patterned tasks across the data science lifecycle. Such tasks can range from those that are largely manual—building statistical models, visualizing their performance against real-world data and explaining their results—to those that can be largely automated. The latter include such traditionally labor-intensive tasks as data discovery, profiling, sampling and preparation, as well as model building, scoring and deployment.

Open-source efforts are beginning to address the need for standardized machine learning pipelines to boost the productivity of data scientists within complex team environments. For example, O’Reilly’s Ben Lorica has discussed one such effort that aims to provide the following productivity benefits:

  • Automated machine learning ingestion and analysis of new data types—especially the image, audio and video content that is so fundamental to the streaming media and cognitive computing revolutions—through sample pipelines for computer vision and speech as well as data loaders for other data types
  • Scaled machine learning algorithm execution on massively parallel Spark-based runtimes, accelerating the training, iteration and refinement of sophisticated models for vision, speech, and other media data
  • Streamlined end-to-end machine learning processing across myriad steps (data input through model training and deployment) and diverse tools and platforms through support for a standard API
  • Richly multifunctional machine learning processing pipelines through extensible incorporation of diverse data loaders, memory allocators, featurizers, optimizers and libraries, among other components
  • Benchmarking of the results of machine learning projects, in keeping with well-defined error bounds, to enable iterative refinement and reproducibility of model and algorithm performance

Empowering modern data science teams

Open-source tools—ranging from Spark, R and Python to Hadoop and Kafka and beyond—form the foundation of modern data science teams. Boosting the productivity of teams of data scientists, however, requires that all specialists—statistical modelers, data engineers, data application developers and subject matter experts alike—share an open-source productivity environment that allows them to do the following:

  • data from diverse data lakes, big data clusters, cloud data services and more.
  • Discover, acquire, aggregate, curate, prepare, pipeline, model and visualize complex, multistructured data.
  • Prototype and program data applications in Spark, R, Python and other languages for execution in in-memory, streaming and other low-latency runtime environments.
  • Tap into rich library algorithms and models for statistical exploration, data mining, predictive analytics, machine learning, natural language processing and other functions.
  • Develop, share and reuse data-driven analytic applications as composable microservices for deployment in hybrid cloud environments.
  • Secure, govern, track, audit and archive data, algorithms, models, metadata and other assets throughout their lifecycles.

How can you continually boost your data scientists’ productivity? On 6 June, IBM will announce its plans for making R, Spark and open data science a sustainable business reality.

At the Apache Spark Maker Community Event, join IBM for a stimulating evening centering on topics of keen interest to data scientists, data application developers and data engineers. The event features special announcements, a keynote, a panel discussion, and a hall of innovation. Leading industry figures who have already committed to participate include John Akred, CTO of Silicon Valley Data Science; Ritika Gunnar, vice president of offering management for IBM Analytics; Todd Holloway, director of content science and algorithms at Netflix; and Matthew Conley, data scientist at Tesla Motors.

Register today for the Apache Spark Maker Community Event, or, if you can’t make it to the in-person event, sign up to watch a livestream of the event. We look forward to your participation!