Bridging NoSQL databases into open data science initiatives

Big Data Evangelist, IBM

NoSQL databases play an important role in open data science. They provide a data storage, processing and access platform for unifying structured and unstructured data, as well as for preparing and cleansing these sources to drive high-performance statistical modeling and exploration.

NoSQL databases and hybrid data architecture

Data scientists use NoSQL platforms as the foundation for data lakes and refineries. As leading data scientist Lillian Pierson stated in a recent IBM Big Data & Analytics Hub post, NoSQL platforms such as IBM Cloudant offer many advantages in hybrid data architectures. Drawing from Pierson’s discussion and adding a few points of my own, I’d like to outline the following key advantages of NoSQL databases for open data science: 

  • Store data from structured, semistructured and unstructured sources.
  • Store and transmit all types of data in JavaScript Object Notation (JSON), which is a standard key-and-value pair format that is easy for both humans and applications to read and write. JSON is especially optimal for dynamic web pages and is natively supported in most standard programming languages.
  • Use SQL to enable access, query and manipulation of both relational and nonrelational data types.
  • Store data in a schema-on-read model that is agnostic to data sources and downstream uses, while offering the flexibility and extensibility that data science projects require.
  • Implement distributed architectures that can scale horizontally to big data volumes, varieties and velocities, while also enabling the massively parallel processing (MPP) required of the most complex, resource-intensive data science pipeline functions.
  • Simplify the complex problem of managing application state at large scale, enabling application developers to store application state locally on devices—tablets, smartphones and wearable devices—while synchronizing that state with the nearest NoSQL cloud database instance.
  • Streamline complex data delivery scenarios by using cloud data services rather than on-premises software deployment to manage distributed database services.

Where open data science initiatives are concerned, NoSQL databases complement Apache Hadoop, relational databases and other functional components within the multiplatform world of cloud data services. Depending on the applications within which NoSQL databases are used, the cloud data services ecosystem might bridge among any or all of the following services: 

To deliver results within any type of data science project, a NoSQL database needs to be deployable in a fully managed and secured environment and accessible on demand or through reserved enterprise instances. A NoSQL database also needs to incorporate a built-in connector to Apache Spark. The connector enables data scientists to load and analyze JSON data in memory, use it in Spark notebooks to conduct advanced analytics on JSON data or efficiently transform and filter data before write back into Cloudant or another data source.

The NoSQL power experience

You can start achieving these benefits today. We encourage data science professionals to sign up for IBM Analytics for Apache Spark and IBM Cloudant, a complimentary three-month trial that offers a no-risk, pay-as-you-go on-ramp to the power of NoSQL and open data science. In this extended, cloud-based trial, gain access to in-support, fast, in-memory analytics on Cloudant JSON data. Also, enjoy 20 hours of complimentary software-as-a-service (SaaS) Startup Advisory Services to help you make the most of cloud data services in your data science business initiatives. At the same time, check out this video introduction to IBM Analytics for Apache Spark and this tutorial on using the service’s embedded machine-learning libraries.