Bridging NoSQL databases into open data science initiatives
NoSQL databases play an important role in open data science. They provide a data storage, processing and access platform for unifying structured and unstructured data, as well as for preparing and cleansing these sources to drive high-performance statistical modeling and exploration.
NoSQL databases and hybrid data architecture
Data scientists use NoSQL platforms as the foundation for data lakes and refineries. As leading data scientist Lillian Pierson stated in a recent IBM Big Data & Analytics Hub post, NoSQL platforms such as IBM Cloudant offer many advantages in hybrid data architectures. Drawing from Pierson’s discussion and adding a few points of my own, I’d like to outline the following key advantages of NoSQL databases for open data science:
- Store data from structured, semistructured and unstructured sources.
- Use SQL to enable access, query and manipulation of both relational and nonrelational data types.
- Store data in a schema-on-read model that is agnostic to data sources and downstream uses, while offering the flexibility and extensibility that data science projects require.
- Implement distributed architectures that can scale horizontally to big data volumes, varieties and velocities, while also enabling the massively parallel processing (MPP) required of the most complex, resource-intensive data science pipeline functions.
- Simplify the complex problem of managing application state at large scale, enabling application developers to store application state locally on devices—tablets, smartphones and wearable devices—while synchronizing that state with the nearest NoSQL cloud database instance.
- Streamline complex data delivery scenarios by using cloud data services rather than on-premises software deployment to manage distributed database services.
Where open data science initiatives are concerned, NoSQL databases complement Apache Hadoop, relational databases and other functional components within the multiplatform world of cloud data services. Depending on the applications within which NoSQL databases are used, the cloud data services ecosystem might bridge among any or all of the following services:
- Hadoop cloud services accelerate enterprise-grade advanced analytics built on open source technology.
- Data warehousing cloud services analyze data where it resides—in the cloud—with a fully managed columnar data warehouse service. Leverage in-database predictive analytics and MPP to do more with your data.
- Graph cloud services enable high-powered storage, query and visualization of data points, their connections and their properties.
- Multiple-workload SQL cloud databases deliver performance and high availability for mission-critical applications and analytics.
- Open source database-as-a-service (DBaaS) platforms support web and mobile applications in a scalable manner.
- Cloud-based data preparation and movement services enable developers to access, combine and transform data.
- Predictive analytics cloud services optimize the future with enhanced decisions today, deployed directly into business processes.
To deliver results within any type of data science project, a NoSQL database needs to be deployable in a fully managed and secured environment and accessible on demand or through reserved enterprise instances. A NoSQL database also needs to incorporate a built-in connector to Apache Spark. The connector enables data scientists to load and analyze JSON data in memory, use it in Spark notebooks to conduct advanced analytics on JSON data or efficiently transform and filter data before write back into Cloudant or another data source.
The NoSQL power experience
You can start achieving these benefits today. We encourage data science professionals to sign up for IBM Analytics for Apache Spark and IBM Cloudant, a complimentary three-month trial that offers a no-risk, pay-as-you-go on-ramp to the power of NoSQL and open data science. In this extended, cloud-based trial, gain access to in-support, fast, in-memory analytics on Cloudant JSON data. Also, enjoy 20 hours of complimentary software-as-a-service (SaaS) Startup Advisory Services to help you make the most of cloud data services in your data science business initiatives. At the same time, check out this video introduction to IBM Analytics for Apache Spark and this tutorial on using the service’s embedded machine-learning libraries.