Blogs

Post a Comment

Semi-Structured Data Analytics: Relational or Hadoop Platform? Part 2

August 11, 2012

For some vendors, the only use case for unstructured data is to turn it into structured data to analyze it in a relational database. This is a legitimate use case, especially where you want to analyze in conjunction with relational data, or you want to make it available for what I called “speed of thought” analysis (more on that term below).

So if you need to get unstructured data into your relational warehouse, then Hadoop is a good way of achieving that. Lots of customers also use Hadoop for ETL on semi-structured data, but there it overlaps with established data integration (DI) tooling; the leading players, including IBM’s InfoSphere Information Server, are already there, but there is no knock-down argument one way or the other. And you’d be unlikely to use Hadoop to extract, transform and load from relational to relational database; why use a less-automated approach when the DI vendors have had years of practice tool-assisting the process?

There are many use cases that call for analytics on the data in its source, unstructured, form – for example: low-latency analysis of very large volumes of instrumentation data (think Vestas).

But there is a wider set of use cases focused around innovation. In these cases we don’t necessarily know what we’re looking for, or we do but we’re not sure it’s there, or we don’t know how to find it. This is exploratory analysis. It’s Hadoop as a sandbox. It’s not exploratory in the sense of learning about the technology (that’s part of any new technology adoption) but exploratory in the sense of understanding what the data can tell us. For example, a telco in London used Hadoop to explore raw network CDRs to look for subscriber usage patterns that might indicate propensity to churn. Retailers are experimenting with sensor data to track individuals in-store to try to optimize layout of bricks and mortar stores – just as etailers use web clicks to optimize estores.

If the data is already in a relational warehouse, there are good analytic tools to explore it there, but if it’s not been loaded and if it’s high volume (expensive to load and manage in a warehouse), then a Hadoop cluster may be a better place to be – if you have access to the necessary analytic skills and tools.

People often talk about repeatable reporting as being a natural relational warehouse use case, and in most organizations there already is a mass of repeated or repeatable reporting analytics and query apps – all part of the standard workload on a relational warehouse; of course it makes no sense to migrate that to Hadoop. But if you want to do repeated analytics on a mass of short-lifetime data that is not in relational format to start with – and that maybe represents a lot of data for little valuable content – that smells like Hadoop spirit. A good example might be social media monitoring, looking for volumes of references to product/category/competitor, identifying vocal and critical customers (think “United breaks guitars”). There’s huge volumes to trawl through; most of it will add no value beyond incrementing a counter, and once it’s been counted it’s done with.

You might also be doing something analogous on sensor data, and if that sensor data is not going to end up in the relational warehouse anyway, there’s “fewer moving parts” and probably cheaper ones if you analyse it in its raw format.  Incidentally, if it’s data streaming off a network, it may be that IBM InfoSphere Streams analysing it in motion is the right tool, but let’s not get distracted. If the data is acquired in large chunks (typical of social media data), or the analysis needs to reference large volumes of it at once, or is sufficiently complex (think identify, categorize, summarize, correlate, rank) that’s not Streams.

Ten years ago, when Hadoop was a wobbly baby elephant, ad hoc analytics in a relational data warehouse was a nightmare. Ad hoc (there’s a clue in the title) means you don’t know what question you’re going to ask of the data, and often once you have the answer, you find you really wanted to ask a slightly different question. Sometimes this is called evolutionary analysis, and it differs from the exploratory analysis we talked about before only in that you know what the structure and content of the data are – you just don’t know what it can tell you.

In a conventional relational database, you typically optimize performance, using a vast array of techniques such as data partitioning, indexing, cubing, summarizing, materialized views etc, etc.  But your optimization is always aimed at making a particular query or family of queries perform well. So these unpredictable, constantly evolving queries might easily (normally!) be un-optimized, so our poor ad hoc analyst ends up waiting for interminable queries to complete while other users of the warehouse curse them for consuming masses of the limited resources available.

But in the last few years the IBM Netezza Appliance has been solving exactly this problem, delivering ad hoc analysis with “speed-of-thought” analytic performance (think Neilsen[i] or BSkyB), and others).

So, would you do ad-hoc analysis of relational data in Hadoop? Not really. As I said, it was a legitimate use case for the baby elephant to aspire to, but now it has come of age, relational solutions have also emerged (specifically IBM Netezza) – exploiting the same massively parallel architecture that Hadoop deploys.

The same is true, to some extent, for semi-structured data, which as we discussed above, can go either way. But relational does, for the moment, have an advantage for ad hoc, because the person who understands how to wrangle data in the most mathematically abstruse ways probably has more productive tools at her disposal in the relational world right now. In the Hadoop world, it’s write some more MapReduce Java classes or generate Java from some as yet not quite as productive or comprehensive tools.

Of course, if the data are genuinely unstructured, then the only difference with exploratory analytics as described above is that you do know what the data mean, you still want to perform this “suck-it-and-see” analysis. In which case it may be a toss up between Hadoop to analyse the data in its native form and transforming it to relational to gain the possible benefit of relational-based analytic tooling. It’s swings and roundabouts, so sometimes the decision will come down to skills. And by the way, there’s a good case for cutting your Hadoop teeth (or tusks) on data you understand.

So to summarise, if it’s data in motion (remember the babies being monitored), it has to be real-time.  It has to be Streams. That’s the easy one.

If it’s unstructured data, at rest, the best place to start is IBM InfoSphere BigInsights, though you may load data into the relational warehouse subsequently for further insight.

If it’s relational data, it’s unlikely you are going to move it to Hadoop.

If it’s semi-structured you have a choice, and you’ll be influenced by these other development factors:

  • If you don’t know what the data has to tell you and it’s unstructured data – you’re an analytic explorer & BigInsights gives you the capability to look at any data any way you like.
  • If it’s structured data already in the warehouse – well, a few years ago even that would have been a good candidate to move to Hadoop, because traditional data warehouses were not good at ad hoc analysis. There may be other reasons to use Hadoop, but relational ad-hoc query performance (at least on IBM Netezza) is no longer one of them.
  • At the other end of the spectrum, if it’s known, structured data and a known requirement – that’s relational data warehouse home territory.

[i] This is quite an old story, but still cool and Nielsen have moved even further forward since.