Granting a Gateway to Powerful Analytics: Part 1

Prevent direct business intelligence access to <br>data structures for agile data analysis

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

Many business intelligence (BI) tools have such a breadth of functionality that their loyal aficionados claim the capacity to slice bread and julienne fries. But beyond the hype there are semantic layers, and rich drag-and-drop and point-and-click capabilities, that really do enable analysts to be far more productive than they can be when using primitive query interfaces. Building an interface for one of these animals into the IBM® PureData™ Analytics powered by the IBM Netezza® technology platform would seem to be a no-brainer. The appliance does analytics. The BI tool is born for analytics. It’s a match made in heaven. Why then do the most experienced among the tool’s line-of-business users and even product support engineers become flummoxed, dazed, and confused when attempting to implement analytics through Netezza? Is there a trick here? Is there perhaps some kind of machine voodoo they must perform in the evening hours under a full moon? Maybe they need to first pour potions into its network interfaces to squeeze out the BI nectar they so desperately crave? The problem isn’t really about the Netezza appliance or the BI tool. The Netezza machine is a very physical environment. After the data is laid out on the machine’s physics and optimized to deliver top performance, the BI tool should not dictate additional, usually highly inefficient data structures.

Adapting business intelligence to the platform

As a case in point, one very popular BI tool likes to lay down star schemas, but really prefers snowflake schemas. Seeing that Netezza has no indexes, which is a prerequisite for snowflakes to operate, Netezza never met a snowflake schema it didn’t like. But if the BI tool really wants a snowflake, how do its end users accommodate this schema? The answer is to adapt the tool to the Netezza machine, but don’t allow the BI tool to dictate the physical data structures. They will be consummately inefficient no matter how elegant they look on paper. Another important issue arises when the BI tool needs to perform a complex multi-way join, or a complex left-outer join. Some tools know to break this problem up into smaller parts. Others try to make a big, fat query. The smaller-parts formula works best in Netezza, but not all BI tools are created equal. As a simple example, one tool broke the problem into three query stages with intermediate tables in between. The first query summarized the tables’ quantities from 100 million rows into 100 thousand rows. Then the next two queries filtered out the unwanted results from this set to yield around 500 rows. The entire operation required several minutes to complete. But look at what the machine was asked to do: summarize the entire table into a result set for which 95 percent of the contents are about to be truncated. This result is not very efficient. The more effective approach—which takes mere seconds to execute—is to apply the two filters up front and then perform the summary on the radically reduced subset of data. This approach makes perfect sense, but there is no way to apply any hint, coding trick, or metadata configuration that would compel the BI tool to execute these filters in reverse order.

Getting past a simple flaw

More important than getting past this flaw, the BI tool should not have direct access to the physical structures. In a PureData Analytics system powered by Netezza, agility is a key strength, and the freedom to tune data structures for enhanced performance must be preserved. Recall that deriving stratospheric performance from Netezza is not about tuning queries, but about configuring data on the machine’s physics. Once the BI tool is allowed direct access to the physical structures, the freedom to change them or tune them to boost their performance or support new end-user bases is abdicated. Any change in the physical structure will initiate a change in the BI tool. The two are then tightly coupled. Is this really the intended result? Part 2 of this article dives into a simple and elegant approach to overcome these issues, free the BI tool to perform its job well, and maintain top performance in the machine. Please share any thoughts or questions in the comments.