Spatial Data in the TwinFin Analytic Appliance
I’ve been learning more about Netezza’s spatial data capabilities. It’s been a fairly specialized option up to now, but with the next release of the TwinFin software, which majors on heavy -weight in-database analytics, spatial data processing becomes available to all our customers.
When I was looking at examples of spatial data analysis I found this, which I think is the first historical example of spatial data solving a real-world problem; I don’t think Netezza can claim any credit for this one. It is stunning how obvious the solution is, when you see the image. But then I thought about it a little, and I realised, yeah, it’s obvious because Dr. Snow had already theorised that the water sources were the problem. And there were only 100 data points so you could see them. What if he had shown water sources plus housing density, shops (by different type of produce), enclosed areas (another theory was bad air transmitted Cholera) and what I might call communal drainage access points (we’re talking Victorian London here)? Even for such a small sample he would have had to use some pretty complicated mathematics to extract the pattern (something to do with discarding outliers and then looking for minimum mean-square distances from possible sources? That’s where visualization of spatial data let’s you down, It’s great for showing the results of data analysis, but to do the analysis means complex processing of the data. And then what if it was 10 million data points and any number of possible variables that affect what happens at them?
We’ve been working with an organization that wants to ascribe risk on household insurance based on proximity to obvious hazards (flood vulnerable sites, potential terrorist targets, fire hazards etc). That’s easy enough to do for a single policy (assuming you’ve got a database with the co-ordinates of all the hazards and of all the insurable locations). But what they want to understand is where they have excessive risk exposure so that premiums can be based not just on individual risk but on the aggregated local risk. And the theme that runs through the projects of many customers and prospects using spatial data is that processing spatial data within the database of the Netezza MPP box means that calculations that took days before on specialist GISs take hour or minutes. If you’ve got 8,000,000 locations, each represented by a row in one table and you want to compute minimum distance to each of 15,000 locations, represented as rows in another table, having hundreds of cores each processing a bit of that huge calculation in parallel is a big help. And there’s no programming because spatial data manipulation is just built into the SQL. What that means is that questions that couldn’t be answered in time to make any difference are now being answered. And what i haven’t taken account of here is merging spatial and other data. The obvious possibility is the use of location as an aspect of demographics for marketers targeting mobile users. One day i’ll be walking down a street in a strange city and i get a text telling me there’s an Apple store two blocks away (actually I originally wrote Armani, but Apple seemed a better choice in this context).
Anyway I’ve got to put my spatial learning on hold for a week; I’m off to Oracle Openworld and I’m expecting a frosty welcome from some of my old colleagues because we (Netezza) have upset some Oracle folk a little lately by trying to point out how TwinFin is different from Exadata(pdf) and why it’s inherently more suited to complex analytic processing of huge data volumes. So i’ll have my conciliatory hat on; we freely acknowledge that the Oracle database is great for OLTP, especially on Exadata and we’re not even in that market. We’ll just have to agree to differ on the analytics side of the story and leave that to the customers, which I’m more than happy to do. If you’re at OOW, come and see us at Booth #3141 Moscone West.