It takes a team: Collaboration and workflow in open data science

Big Data Evangelist, IBM

Data science is a team activity. Ideally, you want to build a team in which diverse specialists pool their skills and knowledge, wield a core set of sophisticated productivity tools and collaborate flexibly and efficiently.

Data scientists produce a steady stream of machine-learning, predictive, segmentation and other advanced analytics models. As a team, its effectiveness depends on having advanced tools such as Apache Spark and languages such as R and Python at their disposal. It also depends on having tools to support creative design, agile collaboration and workflow management of data, algorithms, models and other artifacts. In addition, high-performance teams need distributed environments that can accelerate automation of many pipeline functions—ranging from upfront data discovery and acquisition to downstream data wrangling, refinement, modeling, exploration and governance—to the maximum extent possible.

Dynamic creativity

Open environments are revolutionizing the data science landscape. Increasingly, the more innovative ideas for data science projects are emanating from new participants, such as self-taught citizen data scientists and from crowdsourcing communities. In the process of opening to fresh thinking, data science teams are becoming inexorably more dynamic, creative and productive than ever. This 21st century data science collaboration will thrive on open tools and platforms that enable collaborative sharing of ideas, samples, templates, models, requests and feedback across geographies, projects and platforms.

As data science initiatives open up to include diversified knowledge ecosystems, team performance is expected to grow by orders of magnitude. In open data science environments, team productivity will inexorably expand along all of these dimensions: 

  • Produce, refine and deploy a far wider range of machine learning and other statistical models and applications more rapidly than ever
  • Develop these artifacts in a much wider range of tools and languages than ever
  • Design a greater number of models than previously that incorporate highly complex feature engineering and a wide range of predictors
  • Construct these models from much larger and more diversified libraries of algorithms than ever
  • Train and score the models from large volumes and varieties of data sources more rapidly than ever
  • Accelerate data acquisition, transformation and preparation in a highly automated fashion
  • Deploy models into a much broader range of business applications more rapidly and efficiently than ever

Model-wrangling management

Though the productivity potential is undeniable, a flipside risk exists. Open data science teams may become too productive for their own good. High-performance teams may become swamped with more models—and with more versions of those models in various stages of iterative refinement—than they can easily and securely track and manage.

To mitigate these risks, teams need collaborative workflow environments that support continual tracking and control of their collective work product. These governance features will be essential for maximizing effectiveness and reducing wasted resources throughout the data science lifecycle.

Data science community event

How will you tap into the exciting promise of open data science in collaborative development initiatives while also ensuring strong workflow, governance, security and management of these efforts? On 6 June 2015, IBM will share important announcements for making R, Spark and open data science a sustainable business reality. At the Apache Spark Maker Community Event, IBM is hosting a stimulating evening featuring topics that are of keen interest to data scientists, data application developers and data engineers.

The event features special announcements, a keynote, a panel discussion, and a hall of innovation. Leading industry figures have committed to participate. They include John Akred, CTO at Silicon Valley Data Science; Matthew Conley, data scientist at Tesla Motors; Ritika Gunnar, vice president of offering management, IBM Analytics, at IBM; and Todd Holloway, director of content science and algorithms at Netflix. Be sure to register for the in-person event. If you can’t make it to the in-person event, then register to watch a livestream of the event.