Big Data: The Data Variety Discussion
We'll start from the very beginning. It's a very good place to start...
Big data is all about Velocity, Variety and Volume, and the greatest of these is Variety. At least it causes the greatest misunderstanding.
Variety, in this context, alludes to the wide variety of data sources and formats that may contain insights to help organizations to make better decisions. Everything from our existing database records of customer purchases, to their tweets about us, to web-logs indicating their trail though our website, to audio files of their conversations with our reps in the call centre. That’s just a customer view to a retailer. You can add in videos from cctv, readings from smart grids and other networks, instrumentation of all kinds on all kinds of appliances (the first big data app I ever conceived was the smart-fridge that tracked my beer consumption and re-ordered automatically).
Some people talk about structured and un-structured data. By structured they mean relational and by un-structured they mean everything else. Those people often have a vested interest in un-structured data analytics technology so it’s in their interest to define their product’s scope as widely as possible and to keep the data warehouse in its little relational box.
But there are countless examples of organizations using relational data warehouses to successfully process what, by that definition, is un-structured data. The one I always trot out here is Call Data Records (CDRs).
These are little packets of data, created on mobile telephone networks for every call. I think there are at least four per call and they contain data like caller id ,callee, start time, end time, source network (e.g. Sprint), destination network (e.g. T-Mobile), etcetera. The network service providers use CDRs to bill us for our phone usage and to bill each other for the calls they complete on each other’s behalf. These billing apps and particularly the reconciliation apps that allow a provider to check that its partners are not billing inaccurately (imagine the bill from T-Mobile to Sprint – with about a gazillion line items on it, each to be checked off against a line item on one of their millions of customers’ bill). It’s a simple app if you can handle that volume of data, and IBM has lots of customers who justified the cost of their IBM Netezza boxes on just that app. But that data isn’t relational. It comes straight off the network. It must be un-structured so it can’t be a relational data warehouse app! But it is structured. You know exactly what each byte of a call data record is. It’s totally structured – just not relational.
So here endeth lesson one:
‘The only data you process in the data warehouse is the relational data from operational CRM & ERP systems. Everything else goes in Hadoop’ is way too simplistic to be a useful guide when you’re building a big data strategy.
Another example might be sentiment analysis of tweets. Now I’ve heard relational database experts say that you can analyze tweets in a relational database, even thought they are text + tags. (Definitely semi-structured), but you can easily enough model a relational schema to hold tweets (this is left as an exercise to the reader), but I tend to the view that you don’t need to if you have your friendly Hadoop cluster to hand. And this brings into play the next factor – economics.
The cost of storing a terabyte of data on a Hadoop cluster is a lot less than a terabyte in a corporate data warehouse. That’s because, in 2012, Hadoop is likely to be sitting on cheap commodity hardware, in some R&D corner of the enterprise without concerns (and extra expense) relating to governance, data cleansing and a host of other technologies and processes that are essential for managing proven high-value assets. Right now it’s a terabyte (or maybe a hundred terabytes) of unexplored data of uncertain value that might yield insight gold, but how you don’t know right now.
So why not start there? If and when you want to move your cool new Hadoop tweet analytics app into production you’ll have more costs to bear, to production harden your R&D Hadoop install, but that’s a question for when the app has proven business value that will justify that cost.
So that’s why I split it out into Unstructured, Semi-Structured and Relational.
Relational speaks for itself – typically this is the standard fare for data warehouses – extracted from ERP and other operational systems. We already know what the data means and what its structure is.
Un-structured is at the other end of the spectrum. It might be in any form: text, audio, video. We definitely don’t know from looking at the data what it means – unless we apply human understanding to it.
Semi-structured is everything in between. Web logs in the form of XML documents, call data records from networks, statuses from components in a smart grid, GPS readings that locate a smart phone. None of these is relational data, but equally the internal structure of all of them is precisely known. So structured or unstructured? That’s not the point; the point is what you want to do with them.
In this post I’ve tried to indicate that there’s no cut-and dried way of deciding how to analyze data just from its source, but I’ve only skimmed the surface and raised more questions than I’ve answered, so I’m planning a series of posts to look at all the other factors you need to consider in building big data analytics apps.
For More Information On: