Open for Data: Developing open, cloud-based data warehouse architectures
Open data environments are deepening their presence in many enterprise architectures. What are the chief functional components of an open environment for IBM Cloud Data Services?
The Cloud Data Services environment
In February 2016, IBM announced a broad expansion to its Cloud Data Services portfolio, under the Open for Data initiative. In the process, IBM articulated the main elements of an open Cloud Data Services environment that includes the following technologies:
- Hybrid open source data platforms: IBM Compose Enterprise provides a managed platform that enables you to rapidly deploy several open source data platforms in a hybrid architecture on the IBM Bluemix platform. The hybrid environment may include any or all of the following technologies:
- Scalable NoSQL data stores such as MongoDB, which is a NoSQL document database
- etcd, a NoSQL distributed key value store
- Redis, an open source, NoSQL, in-memory data store that can be used as a database, cache and message broker
- Relational database management systems (RDBMSs): PostgreSQL; search technologies such as ElasticSearch, a scalable distributed search engine; and message-oriented middleware, such as the RabbitMQ message-broker software
- Graph database: IBM Graph, a fully managed NoSQL graph database service for graph analytics that is based on Apache TinkerPop
- Predictive analytics representational state transfer (REST) application programming interfaces (APIs): APIs that can be called from any programming language, thereby enabling predictive models developed in IBM SPSS Modeler to be programmatically deployed on the cloud and invoked from any Bluemix application
- Publicly available data sets: Data sets offered through IBM Analytics Exchange, an open catalog of data sets that can be used for analysis or integrated into applications
As I pondered over the announcements, clearly this scalable, Bluemix-based Cloud Data Services platform supports two principal types of data workloads: operational and analytical.
In the context of operational systems, we expect them to store a lot more than transactional data and be able to scale to support digital channel access through web, mobile and social networks. Transactional databases such as IBM DB2 are already available on Bluemix. IBM Cloudant, a NoSQL document database, is also available for storing nontransactional session data, profile data and shopping cart data—all of which are needed to support development of cloud-based, scalable, operational web applications.
The Open for Data initiative adds more choice by making third-party NoSQL data stores available to give people both IBM and non-IBM options for building scalable web operational applications. These NoSQL database services can also be used for very high velocity ingest of transactional and nontransactional data. Overall, Cloud Data Services now supports a broad range of operational workloads.
Together with the data and analytical services already available on Bluemix, the aforementioned announcements make building out an end-to-end analytics platform on the cloud possible. Analytical services already available before the Open for Data announcement include these technologies:
- IBM BigInsights for Apache Hadoop
- Analytics for Apache Hadoop
- Apache Spark
- dashDB, a data warehouse-as-a-service (DWaaS) analytical database management system (DBMS)
- Streaming Analytics, powered by IBM Streams
- IBM DataWorks for data refinement and data integration
The Open for Data initiative adds IBM Graph and ElasticSearch into the mix. In terms of analytical workloads, these additions mean that IBM can provide support for the following technologies on the cloud:
- SQL and NoSQL databases, including traditional data warehouses
- Hadoop and Spark for exploratory analytics on multistructured data
- Streaming analytics on data in motion
- Predictive analytics as well as analytics with R
- Graph Analytics
- Full-text search and text analytics
- Data integration
- Data sources such as Twitter, weather data and many other sources
And Spark can connect to all of it—including many of the aforementioned NoSQL data stores—making possible the building of scalable, in-memory analytics applications across multiple underlying analytical platforms. In addition, IBM BigSQL—part of the IBM BigInsights for Apache Hadoop service—not only supports SQL access to Hadoop data, but it also supports federated query processing to enable you to join Hadoop and non-Hadoop data. In addition, ElasticSearch can potentially index data across all data stores. As a result, you have a pretty powerful platform—all of which can be built on the cloud.
You may think, as I did initially, that something like IBM Graph would conflict with Spark GraphX, especially when you take into consideration that IBM has made a strategic commitment to Spark. I wondered what is happening here. However, I then noticed the very smart decision to make IBM Graph available with TinkerPop APIs. Because Spark GraphX also has a connector to TinkerPop API-compliant graph data stores, even these pieces snap together to allow Spark to accelerate graph analytical processing by leveraging massively parallel processing (MPP).
Bluemix support for the Internet of Things and streaming data gives developers the ability to build operational analytics and integrate real-time recommendations and analytics into core applications.
Cloud-based data warehouse architecture
Clearly, IBM is providing the technology components that can be used to build a logical, cloud-based data warehouse architecture by combining traditional and advanced big data analytical technologies. IBM dashDB is a cloud data warehouse service ideal for analytics, reporting and business intelligence. Learn more about IBM dashDB and start your free trial.