The evolving shape of distributed databases in the Internet of Things

Big Data Evangelist, IBM

We all think we know what a database is—but do we really?

Most of us were schooled in database fundamentals that were grounded in the basic relational principles laid down by IBM-er Edgar F. Codd more than 40 years ago. Relational databases layer features such as transactionality, replication, indexing, caching and materialized views which we’ve come to associate with enterprise database management systems. Nevertheless, at the core of the discipline remains the 12 rules that Codd specified many long years ago.

At heart, a database remains an organized collection of data that represents a state of affairs in the subject domain as implemented through a consistent schematic model. Another way of expressing this is the notion that a database is a “global, shared, mutable state,” as discussed in this thought-provoking recent post by Martin Kleppmann. This definition combines four concepts, discussed here in reverse order of their sequence in that phrase:

  • State: A database is a stateful resource. This pivots on the concept of “state,” which refers to an identity and a value associated with particular logical entity at a particular point in time, and for which the accuracy, consistency and integrity are ensured through transactional controls such as ACID.
  • Mutable: A database is a mutable resource. This means that the values of the entities described in the underlying schematic model may change over time, but the identities of those entities remain constant over time.
  • Shared: A database is a shared resource. This means that it the data is addressable and accessible via primary and secondary keys.
  • Global: A database is a global resource. This means that all information in the database implements a consistent schematic model which may change over time.

Most of these fundamentals haven’t changed in the intervening decades, though NoSQL databases with their emphasis on “eventually consistency” have pushed the transactionality bar into looser, less ACID-ic territory. With that in mind, I took great interest in Kleppmann’s discussion of “turning the database inside-out,” specifically with regard to his vision of the evolving database as an “always-growing collection of immutable facts.”

That sounds suspiciously like the data warehousing notion of a “single version of the truth,” but it hints at an architecture that might be applicable across a wider range of data management use cases. As described by Kleppmann, his next-generation database architecture involves taking “streams of facts as they come in, and functionally process them in real-time.” Considering that this hints at a grand synthesis of at-rest databases and in-motion data-stream processing, I paid even closer attention.

More concretely, Kleppmann describes this vision as a “distributed, durable commit log” which spans two or more streams and which constitutes the entire virtual data collection accessible to applications. In other words, it’s a distributed event log in which the “immutable facts” are tagged by time and which, like any commit log, can be rolled back to reveal the entire state of affairs being represented in the distributed data collection at any point in the past.

The more I thought about this concept, the more I realized that it’s essentially what underlies the established notion of “temporal databases.” Essentially, temporal databases are audit logs for time-series discovery, which persists two (in Kleppmann’s words) “immutable facts” for any given state: the actual value back then, as it was known back then (transaction time), and the actual value back then, as it is known now (valid time). None of these concepts are new, and, in fact, were incorporated by the database industry into the core SQL standard (which, of course, derives from Codd’s groundbreaking work) in the early 1990s.

With all of that in mind, I wondered if Kleppmann was saying anything radically new or simply rephrasing temporal database concepts and applying them in a new application context. In his case, that new context is the NoSQL distributed stream-processing framework called Apache Samza.

As I thought about it, I realized that what Kleppmann was proposing is a fundamental new framework that, though it obliquely draws on temporal database computing principles, goes well beyond databases as we know them, into the new world where the practical distinctions between data at rest and data in motion are dissolving. It’s all about temporal stream computing. It’s geared to a world, such as the Internet of Things (IoT), in which the data collections are mutably dynamic collections of time-stamped event-log messages whose immutability (data integrity) is enforced at the message level, so that any views of the distributed logs can always be rolled back to a single “transaction time” version of truth as known at any point in the past. Check out my post on how distributed event-logging infrastructure is the principal database abstraction in the IoT.

Kleppmann pretty much says all of this, though not in so many words. By the notion of “turning a database inside-out,” he speaks of the notion of a “replication stream” not being peripheral to the core concept of a database (per Codd’s framework), but of being central: as a transaction log and event stream. “Now,” under his conception, “each write is just an immutable event that you can append to the end of the transaction log. The transaction log is a really simple, append-only data structure.”

A key concept of his that I hadn’t previously considered is the central importance of materialized views in this new streaming big data world of distributed event logs. In a world where data at rest may become a quaint, outmoded concept, materialized views become the core access abstraction. As Kleppmann notes, “A materialized view is just a cached subset of the log, and you could rebuild it from the log at any time. There could be many different materialized views onto the same data: a key-value store, a full-text search index, a graph index, an analytics system and so on.”

That’s a powerful concept. As it gains traction in real-world IoT and other stream-centric data environments, it could pave the way for a world where databases in the traditional sense occupy a less pivotal role in the end-to-end architecture. You won’t need to reconsider your commitment to relational, columnar or any other established database technology any time soon, as all of it plus Hadoop, NoSQL, in-memory and other newer architectures will play together very nicely within hybrid virtualized environments.

But the IoT’s advance will almost certainly put distributed stream-computing platforms, such as IBM InfoSphere Streams, in the driver’s seat of database evolution in the decades to come.

Get a glimpse of IBM's sophisticated new approaches for IoT analytics in the cloud and try a 30-day free trial for IBM IoT Foundation today.