Semi-Structured data analytics: Relational or Hadoop platform? Part 1
'Semi-structured’ started being used to try to define the bone that SQL, NoSQL and others can really fight about.
Hardliners from both sides will claim you can use their product for all analytics. But truly unstructured data is much more naturally processed in Hadoop than in a relational database because you don’t need to worry about your evolving understanding of the data. By contrast, with relational, you have to impose a schema on any data before you can load it. And with text, audio, video or mixed media, you have to explore the actual data before you can understand it.
But what is semi-structured data?
Let’s start with an example. Call Data Records (CDRs) on a mobile telco’s network indicate, amongst other things, who called who, when and for how long. Telcos use this basic information to prepare bills for your subscribers. And there are plenty of mobile providers who have saved themselves significant amounts of money by also using CDRs for revenue assurance to cross-check subscriber bill with partner telco pass-thru bills for completing subscriber call (see the Carphone Warehouse story).
The traditional approach to loading data warehouses is by cleansing, aggregating, summarizing, whatever-izing data from your operational systems – which are already relational. CDRs are not relational, although they are structured. In a raw state they are packets of bits but each bit is well understood. This little chunk is the timestamp, this is the caller number, this is the callee number etc. And those are all datatypes that can easily be mapped into relational schemas.
In fact what I’m calling ‘semi-structured’ some people call ‘structured, but not relational’. It’s also often called instrumentation data, when it comes from sensors or other instrumented sources. So it can easily be transformed to relationally structured date, but it can equally be loaded directly into a Hadoop HDFS file system and processed in raw form there.
The general lesson to learn is that semi-structured data swings both ways, so the technology you use to deal with it must depend on other factors. For billing we’ll go relational – because we need to integrate with other data and systems in the relational warehouse. But for other applications, we might not.
For example, say that we were doing affinity analysis in our CDRs, trying to discover the social networks used by our subscribers. Our goal is to identify the influencers within those networks who represent perfect targets for our next handset launch campaign.
You can do that in either platform. You might make your decision based on the skills and tools you have. And it might rest on cost because Hadoop, the open source core of BigInsights, was explicitly designed to run on grids of low-cost, commodity servers. So especially as you explore the data, and in this case as you develop your ideas about affinity analysis, a cheaper platform is easier to justify. It makes sense to do this kind of exploration away from your production warehouse platform and you only need a subset of CDRs for billing. There’s much more data the network can give you (so-called XDRs) that might affect your analysis.
The way to resolve the potential fight about semi-structured data analytics is by looking to the next level. The data doesn’t lend itself exclusively to either platform. Is there a close affinity to existing relations systems and data? Is there a need to explore and evolve an understanding? Even these two statements are only rules of thumb to guide further investigation of real needs and current situations, which I plan to come back to in a later post.
There is a genuine overlap of capability between Hadoop and relational warehouses here, so we can’t make the decision just on data type. And anyone who says you can is just betraying their personal preference or maybe commercial alignment. It’s about use case and it’s about organizational circumstances and skills, but it’s not about data type.
If you want to know more about the big data story, can I recommend this virtual event, "Big Data: The Art of the Possible," on 28th June at 10 AM British Standard Time? It’s an IBM event but we’ve got analyst and industry speakers as well, including Philip Howard from Bloor Research. I’ll be lurking in the background to most of the sessions to take any questions.