Stress Relief: Part 2
Exploit the simple-by-design characteristics of Netezza technology to avoid the slowdown of system stress
After discussing several introductory concepts in part 1, and the primary constructs at our fingertips to achieve performance glory, tying them all together with a bow might be hard to do in one follow-up article. Nevertheless, in keeping with the IBM® Netezza® technology’s where-not-to-look principle when scanning data, be aware of some common pitfalls so that you know what not to do. For example, consider an organization with a table containing more than 160 billion rows. It gets sub-second response time on the majority of queries, and only a few of them take longer than two seconds. This response occurs because the zone maps are set for the primary query path. ORGANIZE ON can assist with this setting fairly transparently. However, select only one ORGANIZE key to start with. Using more than one ORGANIZE key really needs careful consideration and analysis of query history. The keys are not independent but operate like a nested group-by or nested sort.
For distribution, opt for a common join key between the fact table and the most active and largest dimension. The other dimensions won’t matter nearly as much as these two relationships. The more tables that can share this key, the better, but the machine won’t leverage the key unless it actually appears in the join. Start with only one distribution key. Keep in mind that when keys are in the list, they all need to be used to get the desired effect, and using all of them is rarely true of most joins. Pick one and run with it. More importantly, leaving the key out because it does not technically belong in the join is a mistake. An organization in one scenario distributed its data on order-header-id for both an order-header table and the order-detail. The problem with that distribution is it never actually used the key. Interestingly, using the key in the join would not affect the functional outcome of the query. Adding the distribution key to the join not only boosted the performance by more than five times, it also increased machine concurrency because it made the join more efficient than without it.
The IBM PureData™ System for Analytics host can enable query history tracking. Column access statistics and table access statistics are two key views. After the testers have given the model a good shake, these views can be examined to see high-traffic query paths. A key strategy in this examination is to help reduce join pressure by copying dimensional data into a fact table, if the particular columns have high traffic. For example, one organization moving just 10 high-traffic columns into the fact table boosted 95 percent of its queries by 10 times or higher. This enhancement isn’t hard to do or maintain, and a periodic review of query history can reveal additional high-traffic paths. Keep in mind that in Netezza technology, purist data models don’t always fly. Stay away from snowflake models; these increase join pressure. Star schemas are functionally elegant and useful, but in a traditional technology they are essentially contrived models required to boost the performance weakness of the machine. There is no need for such contrivances in Netezza, so if it feels wrong to deviate from the star model—even if it will radically boost power—just do it. There are no star schema cops to arrest us.
If a query has been beaten down to its most optimal form, continued reductions will not yield any significant result. However, if the performance is still unacceptable, refactoring the data is required. Netezza technology is very physical, and optimization takes place next to the physics, with the data, not in the query. To look for enhanced performance, it’s probably time to reconsider the distribution key, the zone map strategy, or both. For example, I once witnessed a bunch of engineers swarming around a Netezza machine the same way I have watched vendor engineers swarm around their own machines during a proof-of-concept tournament using Netezza technology. This observation was alarming to say the least. They were myopically focused on writing the best query or building the best summary table for the sole purpose of avoiding the obvious: the underlying table was distributed on the wrong key. When called into a meeting after a query’s duration seemed terribly long, they discovered they had 10 billion rows in one table, 3 billion rows in another, and clocked the machine at joining data in 1 billion rows per minute. The inner-join of these two tables should take 3 minutes, but instead was taking 30 minutes. That number—10 times 3—screamed “Cartesian.” I claimed that one of the tables must have duplicates, but they swore duplication wasn’t possible. The query had a group-by clause, which I suspected was masking the query’s overall workload. I asked them to remove it and perform a simple count with the remainder. The query produced a count of nearly 30 billion rows. Bingo. There were duplicates in the data, but the engineers continued to deny it. Then we did a simple select count(*), count(distinct keyname) from the table. With no duplicates, these numbers should match. They did not. The face of the team’s leader went noticeably pale as he darted from the room. One of the others quipped, “We’ve been having trouble with that table for over a year, and nobody ever bothered to check for duplicates.” I didn’t point out that we had just solved the problem in less than an hour with simple forensics. I also didn’t point out that I should not have taken their word for the table’s contents. Trust, but verify. Hunting and killing performance bugs in an appliance is simple and straightforward. The only complication comes from attempting to map performance-tuning experience from other platforms into the appliance solution, which causes many to deny the obvious: Netezza is an appliance that is simple by design. Please share any thoughts or questions in the comments.