In the big data scheme of things, you can talk about the "3 Vs," the "4 Vs," or as many "Vs" as your fevered imagination can spin out.
Personally, I'm partial to the "3 Vs" perspective, because volume, velocity and variety all point to the "big" dimension (more data, more speed, more sources) of this phenomenon. Bringing veracity (aka trustworthiness, quality, single version of the truth) into the discussion seems to be bringing in a dimension that's unrelated to the "more, more, more" of it all.
But when you shift the focus from "big" data to "all" data, the fourth V becomes a cleaner fit. Specifically, bringing veracity into the discussion can help you to incorporate an overlapping database paradigm, NoSQL, into the larger perspective. That's because veracity is akin to one of the defining features of NoSQL: the notion of "eventual consistency."
Some industry observers treat NoSQL as the future of all data platforms, but that promise runs up against the hard reality of eventual consistency. As discussed in this recent article by Dave Rosenthal, "eventual consistency means exactly that: the system is eventually consistent–if no updates are made to a given data item for a 'long enough' period of time, sometime after hardware and network failures heal, then, eventually, all reads to that item will return the same consistent value. It’s also important to understand if a client doesn’t wait 'long enough' they aren’t guaranteed consistency at all."
Another way of phrasing this is that NoSQL is, at best, a "lagged veracity guarantee" approach, or, at worst, if you you're just too impatient to wait, "low veracity" or "no veracity." A NoSQL database, such as Riak, Couchbase, MongoDB, DynamoDB and/or Cassandra, may support the "3 Vs" as well as any other distributed database, but owing to its architectural guarantee of eventual consistency, may not support the fourth V as well as, say, an MPP RDBMS.
That explains why I consider eventual consistency a double-edged sword. The feature helps explain why NoSQL databases can scale to an unprecedented extent, but also makes it highly unlikely that these new approaches can supplant RDBMSs from their core online transaction processing (OLTP) and enterprise data warehousing (EDW) platform roles as high-veracity enterprise-data repositories.
As the article states: "When an engineer builds an application on an eventually consistent database, they need to answer several tough questions every time that data is accessed from the database:
- What is the effect on the application if a database read returns an arbitrarily old value?
- What is the effect on the application if the database sees modification happen in the wrong order?
- What is the effect on the application of another client modifying the database as I try to read it?
- What is the effect that my database updates have on other clients trying to read the data?
All of which raises a huge issue for NoSQL databases: in their current incarnations, with this veracity limitation, can they evolve into OLTP and/or EDW roles? After all, the very concept of a "system of record" or "single version of the truth" depends on having a database that supports low-lagged high veracity. By the way, anybody wanting a theoretical discussion of NoSQL's eventual consistency in the context of the "CAP Theorem" of distributed design principles should check out the discussion in the cited article or click to this Wikipedia page.
Actually, high veracity can indeed be achieved with NoSQL databases, but it requires that application developers write tricky custom code for this purpose. This reality inspires little confidence in the ability of production NoSQL databases to enforce strong data consistency. In his article, Rosenthal quotes a Google paper: “We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level.”
This suggests a question that goes beyond technical engineering. Is realistic to expect that NoSQL databases can ever pass the ACID (atomic, consistent, independent, durable) test of transactionality? In the broader perspective, are we asking too much of NoSQL databases to expect that they can or should ever evolve in the direction of supporting strict, low-lagged consistency while also remaining partition-tolerant? In a world of fit-for-purpose databases, are NoSQL databases best suited for use cases where there is no strict requirement for low-lagged veracity guarantees to be enforced by the distributed data stores themselves?
In particular, what use cases might these be? In the recent NoSQL meetup in conjunction with IBM Information On Demand, one speaker called attention to the application-specific nature of many NoSQL deployments. Typically, NoSQL applications have been written purposely to enforce consistency guarantees that distributed database clusters are incapable of maintaining.
Where use cases are concerned, NoSQL clusters are usually deployed as tactical, horizontally scalable repositories for specific content management, document management, text analytics, semantic analysis, graph analysis and other multistructured data applications. Many of these deployments are exploratory analysis, data staging and archiving, but not for operational applications that require access to a continuously current "single version of truth."
Lagged veracity, NoSQL-style, may be "good enough" data quality for these sorts of applications.