Twisting the Kaleidoscope: Part 2
Hybrid-centric approaches to big data platforms bring NoSQL into clear focus
In part 1 of this column, I called attention to the kaleidoscopic conflation of diverse database computing platforms under the banner of NoSQL. In this concluding installment, I'll discuss how data professionals can avoid getting mired in the pointless "what is NoSQL?" definition wars. To that end, I will provide a practical framework for assessing the fitness and constraints of these diverse data platform options within hybridized big-data architectures.
In this era of constant database innovation, the range of established and new data architectures and platforms continues to expand. The chief architectural categories include those that address varying degrees of data at rest and data in motion. The principal market categories, which have diverse commercial and/or open source offerings, include relational and row-based, dimensional and online analytical processing (OLAP), columnar, file system, document, streaming, in-memory, graph, key-value, object, XML, tuple store, triple store, multivalue, and cell databases.
A growing range of commercial and open source offerings hybridize these approaches to varying degrees, and increasing numbers of them are being implemented within various public, private, and other clouds. Note that Apache Hadoop does not refer to a specific data storage architecture but instead to a range of architectures in different categories that can, to varying degrees, execute Apache MapReduce models. These architectures include file systems such as Hadoop Distributed File System (HDFS), columnar databases such as Apache Hbase, stream-computing time databases such as Apache Cassandra, other relational databases, stream-computing environments, in-memory databases from various vendors, and so forth. The same applies to the sprawling menagerie of data platforms lumped under the terms NoSQL and NewSQL, which, depending on who you talk to, may include innovative databases in any of these categories or in a narrow group of them.
The functionality of commercial and open source data platforms and associated tooling varies widely. Every data platform has its own specific capabilities, strengths, and limitations in key areas. These areas include query languages, supported data types, schemas and models, scalability, performance, latency, throughput, optimization, concurrency, transactionality, consistency, and workload management. They also include compression, governance, security, analytic libraries, solution accelerators, in-database execution, form factors, topologies, licensing approaches, manageability, total cost of ownership, legacy integration, and development tooling and interfaces. Frequently, several cloud-based or cloud-enabling commercial and open source platforms are identified in different segments that overlap significantly in many of these features.
The optimal data platform will vary depending on the cloud-based deployment role that you envision for a specific database. The range of potential cloud deployment database roles is wide. At the very least, it includes specialized nodes for data acquisition, collection, preparation, cleansing, augmentation, transformation, preprocessing, and staging. In addition, there is data warehousing, aggregation, and governance; data access, query, delivery, visualization, and interaction; data exploration, modeling, and sandboxing; data streaming and complex event processing; archiving, auditing, and e-discovery; and transactional computing.
For example, Hadoop on HDFS is well suited for implementing a node for acquisition, collection, and preprocessing of unstructured data at rest. Depending on the nature of unstructured data and what you plan to do with it, various other file, document, graph, key-value, object, XML, tuple store, triple store, or multivalue databases may be more suitable. If you need strong transactionality and consistency on structured data in a governance hub, relational and row-based databases or so-called NewSQL databases may be worth looking into. If real-time, data-in-motion requirements predominate for data integration and delivery, investigate the stream-computing, in-memory, and other platforms optimized for low latency.
Optimal data platforms must integrate well with cloud-computing execution and development platforms, whether they are open source, such as Apache CloudStack or Cloud Foundry, or commercial. To the extent that data platforms also conform to open industry cloud-computing reference frameworks developed by the Cloud Standards Customer Council (CSCC), the National Institute of Standards and Technologies, or other groups, so much the better.
Ideally, the selected big data and cloud computing platforms should integrate well with legacy or envisioned investments in analytics development and modeling tools. And they should integrate favorably with data discovery, acquisition, transport, integration, preprocessing, loading, and governance tools; deployment, management, and optimization tools; virtualization, abstraction, metadata, semantics, and federation tools; and business intelligence, search, query, reporting, and visualization tools.
Core advantages of NoSQL approaches
When framing data platform requirements in this feature-centric and hybrid-friendly fashion, the use cases for NoSQL, however defined, and other new approaches come into focus. One of the core advantages of the relational database management system (RDBMS) is its ability to ensure strong data consistency and transactionality, which comes from trading off varying degrees of availability and scalability. In contrast, NoSQL platforms often flip the equation: they provide advantages in availability and scalability that come from relaxing data consistency requirements.
Suppose you decide to explore Hadoop and NoSQL in part based on these particular features. How do you test and evaluate the myriad offerings—commercial and open source—for their availability and scalability—and to a lesser degree their consistency and transactionality—under expected real-world scenarios?
A recent InformationWeek article provides an excellent comparative analysis of the availability of various NoSQL databases based on laboratory testing.1 It reports on a researcher named Kyle Kingsbury. He evaluated “how strongly consistent these databases can be—that is, if you set up a network partition before data can be written everywhere and then you have a network recovery sometime later, what’s happened to your data?”
From a database consistency standpoint, one of the most interesting sections of the article is the following list of questions that Kingsbury says IT should address when testing the databases and when deploying them in a distributed fashion:
- What if we cut the network here?
- What if we isolate a master?
- What if there’s no connected majority?
- What if there’s asymmetric connectivity?
- How about intermittent connectivity?
- What if clocks are unsynchronized?
- What if a node pauses for a few minutes and then comes back?
Another takeaway is Kingsbury’s characterization of the consistency profiles of specific NoSQL and other database management systems (DBMSs), as well as the roles for which they are best suited. The article lists a few of Kingsbury’s preferences: Redis for data caching, Apache Cassandra for very large clusters and high-volume writes, and Hadoop for extremely large data volumes in which slow queries are OK. He also prefers Riak for resolving conflicting versions of an object in a key-value store, Apache Zookeeper for small pieces of state data, and PostgreSQL for object-relational data.
The bottom line, according to the InformationWeek article, is to decide which objectives are intended for the application and select a database based on application programming interface (API), performance, and consistency requirements. Kingsbury also pointed out that for many applications, “the best option is likely to have both strongly consistent data stores and strongly available data stores, and be clear on which is best suited for any particular data set.”
That advice demonstrates an inherently hybrid-centric approach to balancing these requirements. For any big data practitioner, it’s a more useful approach for twisting the NoSQL kaleidoscope into sharper focus.
Please share any thoughts or questions in the comments.
1 “The man who tortures databases,” by Joe Masters Emison, InformationWeek Software, September 2013.
|[followbutton username='IBMdatamag' count='false' lang='en' theme='light']|