Experiencing deeper productivity in open data science

Big Data Evangelist, IBM

Data scientists do exceptionally complex work, and their productivity depends on having access to tools and practices they can use to streamline and accelerate the details in which they immerse themselves. For well over a decade, developers have been using the industry-standard open-source Eclipse framework to simplify access to a comprehensive set of tools for building, testing and deploying application code. Accordingly, as data science moves into the inner circle of next-generation developer competencies, data scientists can expect the emergence of an industry-standard integrated development environment (IDE) equivalent to Eclipse.

To be truly useful, an open IDE for data science should facilitate the following development tasks:

  • data from diverse data lakes, big data clusters, cloud data services and more
  • Discovering, acquiring, aggregating, curating, preparing, pipelining, modeling and visualizing complex, multistructured data
  • Prototyping and programming data applications for execution in in-memory, streaming and other low-latency runtime environments
  • Tapping into libraries of algorithms and models for statistical exploration, data mining, predictive analytics, machine learning and natural language processing, among other functions
  • Developing, sharing and reusing data-driven analytic applications as composable microservices for deployment in hybrid cloud environments
  • Securing, governing, tracking, auditing and archiving data, algorithms, models, metadata and other assets throughout their lifecycles

The IBM Data Science Experience is bringing an open IDE for data science a step closer to reality by offering the following features, as discussed at the Apache Spark Maker Community Event in San Francisco:

  • An interactive, cloud-based, scalable and secure visual workbench for consolidating open-source tools, languages and libraries and for collaborating within teams to rapidly put high-quality data science applications into production
  • Access to open-source tools and libraries—including Spark, R, Python and Scala—as well as solutions from IBM, IBM partners such as RStudio and, and others through an extensible architecture
  • A unified environment for data scientists and other analytics developers that allows them to connect with one another while accessing project dashboards and learning resources, forking and sharing projects, exchanging development assets (datasets, models, projects, tutorials and Jupyter notebooks) and sharing results, with follow-on releases aiming to include comments, user profiles, data science competitions, Zeppelin notebooks and real-time collaboration
  • Built-in connectivity to diverse data sources as well as simplified data ingestion, refinement, curation and analysis capabilities while integrating IBM DataWorks for data integration and wrangling, with follow-on releases aiming to include new features such as data shaping, Spark pipeline deployment, SPSS analytic algorithms, automated modeling and data preparation, model management and deployment, advanced visualizations, text analytics, geospatial analytics and integration with Watson Analytics

Available on IBM Cloud in open beta, the IBM Data Science Experience carries forward the functionality of IBM’s Data Scientist Workbench, which in the year after its release gained more than 7,000 registered users. IBM plans to offer Data Science Experience under a freemium license, extending to customers the option of paying for value-added service plans based on scale, storage and usage tiers.

IBM has also announced further milestones in its commitment to making Apache Spark an open analytics operating system, contributing SparkR code designed to help fully integrate the R language into the Spark environment. In addition to underlining its commitment to the open-source R language by joining the R Consortium, IBM has built Spark into the core of more than 50 of its analytics and commerce platforms, including IBM BigInsights for Apache Hadoop, Watson Analytics, SPSS Modeler and IBM Streams.

Experience the power of this data science IDE today, then explore how you can use Spark and R to build your own data science applications.