I recently started researching a longer piece I’m writing about big data architectures, and I realized something: big data is growing up. It has moved rapidly from a collection of critical, but over-hyped technologies disrupting core relational database market segments to an emerging eco-system to revolutionize data management and analytics.
When I was at Netezza, pre-IBM acquisition, our product was the first data warehouse appliance – massively parallel, shared-nothing architecture, SQL ACID database, in a box. The biggest customer reference databases were hundreds of terabytes, which is still a lot of data now. So we naturally picked up on the phrase “big data’’ and applied it liberally. We didn’t see many use cases that needed more data than we could handle, but we soon met people who said: “no, Netezza isn’t big data, because it’s relational and relational databases don’t scale.” But there was little old Netezza running petabyte scale production databases and being told we weren’t big data. Hmm, something wrong there.
Of course we all soon figured out that big data isn’t just about scale, it’s about data sources you don’t want to impose a relational schema on, or not yet. It’s about high arrival rate data use cases that are prohibitively expensive or impossible for relational databases.
So a bunch of new data stores sprang up, to meet a bunch of different use cases. Hadoop, first and foremost, quickly became the de-facto for big batch processing. Often, though not necessarily, the use case involved complex analytics, such as data mining: looking for patterns that signify significant opportunities. One high-value application is trying to recognize the many different characteristics of a customer and their history that might make them more susceptible to switching to a new supplier. Key-value stores, often where the value was a document, were soon being adopted for store and retrieve or store and query where the latency is low and the volume is high.
A good example is storing and retrieving the state of customers’ shopping carts on a global e-commerce site. Graph databases specialize in use cases where the links between things are what’s interesting.
But, the big data battle isn’t fought just between database technologies. Relational databases have a decades-old ecosystem around them (data integration, reporting, APIs and more) and each of these sectors have their own dynamics. So when new big data stores came along and changed the core of the ecosystem, this was disruptive not just to RDBMS incumbents, but to all the players in the surrounding sub-sectors as well.
Disruption is not new to big data. There are always the same four Rs for disrupted incumbents:
As an incumbent the first reaction is Refuse: refuse to believe the innovation will genuinely disrupt you. This is actually healthy; if you didn’t believe in your own vision you’ll be blown about by every whim of every nearby marketer. The trick is to know when to move on.
The next step is typically Re-position. You’re an incumbent, you have existing territory to defend: a code based, customer base and a market position. But, because Powerpoint is easier to re-engineer than code, step two is usually to adopt some variant of “We do that already,” which may be more or less true. Certainly if big data had ever been about just “big” we’d have been on pretty safe ground with that position at Netezza for almost all cases. And it certainly can be made truer with a little light engineering, in most cases.
However, eventually, the incumbent has to bite the bullet and Re-engineer—big time. That might be re-engineering an existing product or re-engineering the company with a new product or set of products. That was Netezza’s response; or rather Netezza was part of IBM’s response: making Netezza part of a wide portfolio of products to address a wide variety of big data use cases.
Of course the fourth R, Retreat, is for those incumbents who either never get past Refuse or don’t successfully navigate Re-position and Re-engineer.
Enter the new niche players
While the incumbents are wresting to turn the ship around, the nimble incomers are sailing straight into the uncharted waters of opportunity. First they will compete for the core use cases, then they compete to back-fill the equivalent ecosystem around the core.
So in big data back in 2011 we had data integration incumbents claiming big data was just a matter of a couple of new connectors, and three years later we have survivors (and thrivers!) from that era, as well as new integrators, particularly where the data sources or the use case are different. An example of this is live feeds for real-time analytics on high-velocity data.
And we have the same situation playing out everywhere else: reporting/data visualization, analytics libraries and languages, and so on.
Big data, the four Rs and the new kids on the block
So in a pretty short period of time, big data has moved rapidly along the acceptance curve, the data stores themselves dragging the rest of the sector in their hype-fuelled wake. And yet, despite the enormous hype and the inaccurate name, big data has arrived. You know it’s growing up because the ecosystem is maturing all around it. There are evolved incumbents who’ve successfully negotiated the four Rs, there are competitive incomers and there’s occupiers of new use case niches. It’s still consolidating and there is a great deal to do yet, but the hype was, kind of, justified. The world of data management has changed very radically in a very short period of time.
New technologies and new philosophies
And that is the thought that provoked the longer piece I’m working on: the change in technology has to drive a change in philosophy. We’ve moved on from the ideal (often only an ideal) of an integrated Enterprise Data Warehouse to a federation of different technologies addressing different use cases and categories of use cases. Gartner has called this the Logical Data Warehouse. IBM has kept the name and redefined, or at least evolved, the meaning.
What I think it means is that we’ve moved away from a data-driven view of the warehouse (or whatever it’s called now) to a use case driven approach. That means a proliferation of use case-specific architectures and a danger of data-silos all across the enterprise. IBM’s vision is about addressing this from the technology perspective: what data is stored in which virtual or tin boxes and where?
But I’m also interested in how we strike the right balance between freeing up teams to deliver the right solution with maximum value for their specific challenge, and optimizing across all these solutions so the enterprise gets the best return on its corporate data investment. Is there a role for the corporate architect any more? If I can demonstrate ROI for my project to my stakeholders, why is it anyone else’s business? Those are questions that will bear some further examination. And I’m on it.