Next-generation data scientist: Harnessing an integrated development environment
Data scientists do exceptionally complex work. Their productivity depends on having comprehensive, easy access to analytics tools, data and other assets for sustaining productivity in intricate, collaborative environments. A typical day in the working life of a data science professional may involve navigating the challenges of any or all of the following tasks:
- Sourcing: Acquire data from diverse data lakes, big data clusters, cloud data services and more
- Preparing: Discover, acquire, aggregate, curate, pipeline, model and visualize complex, multistructured data
- Modeling: Tap into libraries of algorithms and models for statistical exploration, data mining, predictive analytics, machine learning, natural-language processing and interactive visualization among other functions
- Developing: Prototype and program data applications to be executed within in-memory, streaming, cloud and other runtime environments
- Governing: Secure, manage, track, audit and archive data, algorithms, models, metadata and other assets throughout their lifecycles
The typical data science team involves a multifaceted interplay of roles, functions and workflows. Each of the principal roles has to handle its own set of complexities.
As statistical modelers, data scientists drive each step in the lifecycle of data-driven applications. They bring a holistic view to solving problems that involve complex data, algorithms and statistical models. They engage in highly challenging collaborative arrangements that involve data engineers, developers, business analysts and others to ensure delivery of desired outcomes on business projects with data science at their core. Data scientists need a complex set of technical and business skills and knowledge.
Benjamin Skrainka, principal data scientist at Galvanize, said “A data scientist needs expertise within multiple disciplines. You need to be good at databases. You need some knowledge of software engineering. You need to know some machine learning. And you need to know some statistics.”
Find out more about what actual data scientists really think about their critical role in data science.
Data professionals tasked to manage the process of gathering, organizing, cleansing and integrating data are data engineers, who then ensure that data flows smoothly throughout the data science lifecycle. Data engineers implement and optimize the systems and processes required by other data science professionals and other stakeholders. They also work with front-end developers when moving data science projects into production. And their roles are beginning to blend into those of data scientists.
Jason Hill, senior big data engineer and scientist at CA Technologies, says “the way that we handle it is to have everybody on one team. Traditionally, an engineer wrote the code, and a data scientist developed the algorithm—within silos in their own areas. Now they know each other’s role and work together. We have data scientists that can write code and work with the engineer to develop algorithms.”
As builders who craft analytics applications that incorporate algorithmic models developed by data scientists, developers enable business analysts and other users to realize preferred business outcomes from data science assets in their day-to-day work. Andy Gants, principal data scientist at Spare5, says that the ability of data scientists to collaborate with “a software development department [for] iterating within terms of implementing analyses into specific software features” is essential. “Learning how to perform with the software development department,” says Gants, “proved to be quite a challenge.”
The knowledge workers on the team are business analysts who use self-service analytics tools to develop predictive analysis, machine learning and other data-driven models without coding and without having to request assistance from data scientists. Data scientists themselves need business analysis skills to do their job effectively.
According to Cliff Click, CTO at Neurensic, “Data scientists need a good blend of domain knowledge and business expertise. They need to be extremely inquisitive and relentless at figuring out how to solve a particular problem. That means digging into different approaches and alternatives—not just building models and running algorithms, but also interpreting the results to drive new business opportunities.”
Data science teams deriving benefit from an integrated development environment (IDE) to boost their collaborative output makes perfect sense. IDEs provide a comprehensive, extensible, open framework for accessing tools, data and other resources needed to build, test and deploy executable assets into production environments.
As data science moves into the inner circle of next-generation developer competencies, data scientists can expect the emergence of an industry-standard IDE that is equivalent to the industry-standard Eclipse framework that IT management professionals have been using for years.
IBM Data Science Experience (DSX) is the open IDE for team data science. DSX, which carries forward the functionality of IBM’s Data Scientist Workbench, offers several productivity features.
Open analytics workbench
DSX is an interactive, cloud-based, scalable and secure visual workbench for consolidating open-source tools, languages and libraries. It provides unified access to open-source tools and libraries—including Apache Spark, R, Python and Scala—as well as solutions from IBM, IBM partners such as RStudio and H20.ai, and others through an extensible architecture.
Simplified data and model management
DSX provides built-in connectivity to diverse data sources and simplified data ingestion, refinement, curating and analysis capabilities while integrating IBM DataWorks for data integration and wrangling. Follow-on releases are expected to deliver productivity features such as data shaping, Spark pipeline deployment, SPSS analytics algorithms, automated modeling and data preparation, model management and deployment, advanced visualizations, text analytics, and geospatial analytics and integration with IBM Watson Analytics.
Team collaboration and learning environment
DSX provides a team collaboration environment within which data science professionals can connect with one another and rapidly deploy high-quality applications into production environments. It enables team members to access project dashboards and learning resources; fork and share projects; exchange development assets such as data sets, models, projects, tutorials and Jupyter notebooks; and deliver results. Follow-on releases are planned to include comments, user profiles, data science competitions, Zeppelin notebooks and real-time collaboration.
Learn more about why data science is a team sport, and how each role in the team can work collaboratively to drive data science success. Then master the art of data science today, and register to try out DSX for yourself.
Be sure to register for World of Watson, your new home for putting data to work.