Next-generation data science: Acceleration for team productivity
Data science can be time-consuming work. Increasingly, data scientists focus on the nitty-gritty details of building repeatable artifacts—for example, machine-learning and other statistical models—for deployment within always-on environments.
Accelerating the pipeline of data-science development, deployment and optimization tasks is fundamental to team productivity. Several key pipeline tasks need to be accelerated:
- Discovery and acquisition of data from diverse data lakes, big data clusters, cloud data services and other data stores
- Accelerated ingestion and analysis of new data types—especially the image, audio and video content that is so fundamental to the streaming media and cognitive computing revolutions—through sample pipelines for computer vision and speech as well as data loaders for other data types
- Prototyping, programming and modeling of data applications in Apache Spark, R, Python and other languages for execution within in-memory, streaming and other low-latency runtime environments
- Development of data-driven applications using a reusable, composable library of algorithms and models for statistical exploration; data mining; predictive analytics; machine learning; natural-language processing; and other functions
- Scaled machine-learning algorithm execution on massively parallel Spark-based runtimes, accelerating the training, iteration and refinement of sophisticated models for vision, speech and other media data
- Streamlined end-to-end machine-learning processing across myriad steps—data input through model training and deployment—and diverse tools and platforms through support for standard application programming interfaces (APIs)
- Richly multifunctional machine-learning processing pipelines through extensible incorporation of diverse data loaders, memory allocators, featurizers, optimizers and libraries—among other components
- Benchmarking of the results of machine-learning projects, in keeping with well-defined error bounds, to enable iterative refinement and reproducibility of model and algorithm performance
- Securing, governance, tracking, auditing, and archiving of data, algorithms, models, metadata and other assets throughout their lifecycles
The data science mind
Manual tasks chew up much of a data scientist's time on most projects. Take a look at some comments from real data scientists about the manual challenges they face in the normal course of doing their work:
- Michael Schmidt, data scientist and founder, Nutonian: “Gaining access to data and getting it into the proper format can be extremely time-consuming. Availability issues can range from not having sufficient volume or variety of data, to having extremely inconsistent or dirty data, where the effort to clean, filter or repair is so monumental that it increases the risk beyond what is tolerable. Once you have that data, you then need to find and interpret what it means. It’s a significant effort to set that up to make sure you’re getting the best results possible. And it’s an area in which I’d love to see more tools to make [the process] easier and faster. It’s a tedious process, but we’re starting to see some progress here.”
- Jeff Jonas, IBM fellow and chief scientist, context computing: “Complex data types yield large volumes of information to be analyzed. But it’s not the amount of data per se that consumes the most time; it’s getting it in the right format, augmenting it and figuring out what information might be missing. It’s an ongoing process that we have to perform again and again. While machines help with some of this effort, a large portion of the work depends on the human ability to theorize, interpret and explore the problem and the potential solution at a deeper level. With the right tools to streamline this task, we could spend a lot more time on actual modeling and driving value from the data.”
Move toward automation
You can take a look inside the mind of a data scientist. Data science lifecycle tasks range from those that are largely manual to those that can be largely automated. Typically, manual tasks such as those discussed previously include building statistical models and visualizing their performance against real-world data and explaining their results. Functions that are increasingly being automated include acquisition, cleansing, data discovery, exploration, governance, modeling, preparation and transformation.
Also take a look at a recent LinkedIn blog post discussing the industry push to automate more functions in the data science lifecycle and limits to what can be practically offloaded to machine. In next-generation data science team environments, acceleration of key tasks depends on team members all having access to the following shared resources:
- A single end-to-end execution environment within which data science pipeline tasks can be automated, per Ben Lorica’s discussion in a recent post
- A single analytics notebook framework that enables fast and flexible sharing of assets and interaction on common tasks throughout the data science lifecycle
- An on-demand, self-service collaboration environment such as the IBM Data Science Experience (DSX) that incorporates support for notebooks, pipeline automation and other essential capabilities
To accelerate your data science development and deployment pipeline, join the DSX. And if you’re a working data scientist, data engineer or data application developer, register to attend the IBM DataFirst Launch Event, 27 September 2016, in New York, New York. Engage with open source community leaders and practitioners, and learn how to accelerate your processes for putting data to work in your burgeoning cognitive business.