A Seriously Happy Customer Exposes a Common Dilemma
"I cannot imagine life without Netezza." – from a tweet by "noogle" (Twitter, 11 May 2010)
[Photo credit: 2006 photo of “the scramble” intersection in the Shibuya district of Tokyo, courtesy of "Bantosh" and Wikipedia]
Late in May, members of my team and I were in Tokyo's ultra-bustling Shibuya district for a few days on our "worldwide whirlwind training tour" with the global field sales teams regarding the details of the TwinFin i-Class product offering. The late-night scene, hairstyles and outfits there border on the outrageously-hip. There are high-def billboards, electronic gadgets, and of course the bright lights of retailers, bars, clubs and restaurants all through Shibuya. Advances in high technology are virtually 2nd nature to the people there. So with that as the backdrop, imagine the surprise of hearing over a beer or two (see earlier reference to bars & clubs in Shibuya) that a customer "could not live without Netezza".
We're proud of our highly referenceable customer base at Netezza and our "easy to do business with" relationships with our customers and partners. In my six years with the company I've met a pretty fair number of really enthusiastic customers including people who held "welcome" parties for their Netezza systems, use "Netezza" as a verb ("Did you Netezza that data?") or even an adverb ("It's Netezza easy."). But I can't recall any customer who said that they, "could not imagine living without Netezza".
Simple self-promotion is not the real point of this post though. What is is the thread of a dilemma that noogle presents us with in his 40+ word tweet. It's something that business managers and analysts face on a daily basis: what is more important –
- being able look for strategic and/or tactical competitive nuggets by performing SQL OLAP analytics on their full, atomic-level dataset; or
- looking for that guidance by using advanced analytical toolsets on subsets or aggregations of their data that are extracted from the data warehouse?
Here's the whole tweet by noogle in it's original form:
デー タが莫大になると分析が不可能になる。少ないデータを複雑なアルゴリズムで分析するよりも、莫大なデータを単純なアルゴリズムで分析する方が有益。統計学 とは逆。アホかという量のデータ分析の手助けするのがNetezza。もうあたしはNetezzaの無い世界では生きていけない。
And here's a translation of it into English [parenthetical comments and emphasis are mine]:
When data is huge, complex analytics are impossible. It’s far more beneficial analyzing massive data with simple [SQL] logic, rather than analyzing small data with complicated analysis. This is opposite of statistics [based on sampling techniques]. Analyzing data which is “crazy massive” is Netezza. I cannot imagine life without Netezza.
It turns out that noogle is a long-time user of advanced analytics and predictive techniques. He knows their value, but his tweet exposes of weakness of today's typical analytical environment. By not being able to perform advanced analysis inside the database, most of that work (if performed at all) is done in external servers based on data sets that are extracted (filtered, sampled and/or aggregated) from the data warehouse.
That adds latency to do the extraction and limits the "currency" of the data. Depending on whom you ask, it also limits the accuracy of the results. For instance, looking at aggregations or samples may give you a sense of the "big picture" but not necessarily uncover the needle in the haystack (e.g., fraud detection) or the impact of a long tail that can be exploited in a particular business.
So noogle's choice is to use the analytic horsepower of TwinFin over the sampling techniques. But if one is limited to the set-based logic of SQL, perhaps aided by user-defined functions, you are again limited in the predictive visibility that those tools can provide. Faced with the dilemma this customer chose being able to analyze all the data over statistically sampling and performing advanced analytics on the sample set. Having an answer to that dilemma is precisely what has driven the advent of the i-Class functionality for TwinFin.
We're excited about TwinFin i-Class, but I'm interested in what others may have to say about this. Does your company employ advanced analytical techniques and how have you reconciled the "sampling" versus "full data set" questions in your business? And what are the prospects and pitfalls of doing "crazy complicated analytics on crazy massive data" all in one simple, high-performance data warehouse appliance from your perspective?