Exploring Uncharted Data: Is there any insight out there?

VP product marketing, Acunu

The biggest table in any Netezza database that I know of has over 600 billion rows!! That’s the claim made by our customer, Catalina Marketing.

So although most of the data in the world is not relational, there is a huge amount of relational data and IBM technologies are more than capable of performing the most complex analytics on it. Netezza has extensive libraries of in-database analytic functions1 to support SPSS, SAS, R and other analytic tools and languages.  And the special capability that Netezza has to deal with ad-hoc queries means that if your data is relational, or can be mapped to a relational schema conclusively, like the CDRs I wrote about in a previous post, it is a great platform for analytics.  If!

And of course, in many cases by the time it has been mapped to a relational schema, like the smart grid data PNNL analyze first in Streams, and then in Netezza2, it may well already have delivered value & the warehouse is the place to uplift that value.

But there are some occasions when you would consider analyzing relational data in Hadoop – when it’s cheaper to offload an application (think queryable archive) or when it was never cost effective to address the use case in the first place. For most organizations the additional cost of acquiring the skills plus the additional development time mean these use cases are not yet being widely exploited, but as skill levels and confidence in the Hadoop platform rises, they will be.

Some vendors (that’ll be Oracle then) will tell you that the relational database is the only place for data to be analyzed, and that unstructured data technologies like Hadoop (the open source core of IBM’s BigInsights) are for parsing the unstructured data before loading it into the database – a sort of glorified ETL engine. Well you can do that with Hadoop, and there are good Hadoop use cases for data ingest and cleanse. I’ll get back to those in a later post, but if that’s all you do with Hadoop it’s not necessarily the optimum solution for your use case.

For example, if you want to do sentiment analysis of forum posts, blog posts, blog comments and tweets the most obvious place to do it is in an Hadoop grid, because you don’t know at that stage what the relational schema to hold the data would look like. Sure you can load it into a blob, but that’s not playing to a RDBMS’ strengths. Why not just write some Python or Java, in a series of Map Reduce jobs that allows you to explore the data, its content and its structures3? That way you develop an understanding of what it has to tell you. It’s incremental analysis again, but this time your problem is not what queries you want to run, it’s what aspects of the data you want to look at in your queries.  If the data is already in a relational schema that’s relatively straightforward, but if you’re still exploring what the data means, having to rebuild your schema every time you decide to change your analysis will really slow you down. Hadoop can be more flexible for this kind of exploration.

The key is that to use an RDBMS to analyze, you need a schema – you have to have decided what the structure of the data is and if you got that wrong you have to scrap it and reload your data into a redesigned schema. A good example might be if you are looking at share of mind in particular social media sources for your products and competitor products. If you then realize you’re looking at the wrong sources or you need to add more, you have to change your database schema to accommodate the new data and reload and recode your queries and analytics. In Hadoop you can accommodate the change with a change of the code. This is what is often called exploratory analysis – not only do you not know what you’re looking for you don’t know where to look. And there’s no doubt Hadoop, and especially IBM BigInsights, with its analytic accelerators and tooling, can prove more agile. Though beware, because some Hadoop proponents are thinking of old-style relational databases when they talk about the difficulty of re-structuring. Netezza doesn’t have those constraints (indexing, partitioning, replicating etc.), so the balance shifts, but genuinely exploratory analysis can still be easier in Hadoop, especially because your Hadoop grid is not going to be subject to the same quality and security rigor as the data warehouse – home of highly sensitive, high-value corporate data assets. Other factors that will weigh in the pan will be skills, economics and attitude – am I using a Hadoop sandbox not just to evaluate this particular use case, but to build my skills and capability to bring a load more use cases into play.

Exploratory analysis of huge volumes of unstructured data is a BigInsights sweet spot. But, what you do when you know you’ve found the high-value data (where to look) and the high-value analysis (what you’re looking for)? You may just continue to run that analysis on the new data as it arrives – as a production job. No problem with that, but you might want to combine it with other data in your warehouse for further analysis, and now you’ve identified the data of value, it might well make sense to load that into the relational warehouse. In that case maybe you could describe BigInsights as an ingest tool, but without the ability to do agile exploratory analysis in the first place, you’d never have known what to extract and load.  And that's why Hadoop as an analytic sandbox is an invaluable tool.

1 Registration required to download this white paper.

2 I talked about this and referenced the source in a previous post here

3 I’ve heard people call this kind of data poly-structured, because it may have many structures – depending on how you want to look at it. I’m going to avoid the term because it has a Greek prefix on a Latin stem, which any philologist will tell you is a no-no.