Stress Relief: Part 1

Alleviate the perfect storm of system stress by optimizing <br>Netezza data warehouses

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

After some time, a system running IBM® PureData™ System for Analytics powered by Netezza® technology data warehousing may exhibit signs of stress. Competitors may claim that this stress is caused by having too much data because vast amounts of data can make the appliance run slowly, but that’s not the case. While every IBM TwinFin® data warehouse appliance should be sized according to its workload, signs of stress are usually not because there is so much data. After all, high data volume is the reason for procuring the appliance in the first place. There’s no such thing as too much data. So how do we address the issue of stress? Stress for a Netezza machine typically arises when functional data, operational processes, end-user query frequency, and complexity all create a perfect storm. This perfect storm is actually a good thing because it means people are using the technology voraciously. The experience just needs to be optimized.  

An extreme approach

To begin, consider something radical. When first rolling out the device, throttle back the users to about 50 percent of the machine’s capacity, but don’t tell the developers or end users. This note of caution is very important because they will want the entire machine and all of its luxurious power, which can come at a long-term price. With virtually unlimited power, people can get sloppy in how they write queries or handle data. They may plop the un-optimized data onto the platform, whip up some queries to address it, and proceed to ride off into the sunset with 10 times the performance compared with the prior technology. However, a universal maxim of high-powered machines is that they make lousy data models look stellar, and sloppy queries even more so. The inefficiency of both inferior data models and untidy queries that eat the machine from the inside out is not obvious. The impact is not like a corrosive acid or anything like that, but more like accelerated aging. Make no mistake; other platforms suffer the same issue, yet very often their sales teams actually encourage organizations procuring them to avoid making any changes when migrating over. Doing so stands up the solution more quickly and starts the accelerated aging process. Sooner than they should have to, these organizations will want an upgrade, which is just fine for vendors who want to sell more hardware.  

Simple physics

Optimizing Netezza appliance technology in an IBM PureData System for Analytics application is really easy and straightforward. If optimization isn’t performed during the first migration, there’s really no time like the present to refactor and recover lost capacity. Typically, several primary misconceptions usually result from organizations trying to bridge concepts from other technologies into the Netezza technology. In part, the inexperienced try to map their understanding of the following logical capabilities when Netezza is a very physical machine:

  • Indexes: An index is a way to find the exact location of information. In systems of scale, indexes constrain and do not assist; ultimately, searching the index takes more time than retrieving data. Indexes are a bane of data warehousing, and eliminating them is a stroke of pure genius.
  • Partitions: From legacy technology, partitions are merely contrived performance props to assist a weak database engine. They anticipate and constrain query paths. Zone maps replace partitions and broaden end-user query paths without anticipating or constraining end users.
  • Zone maps: In Netezza data warehousing, zone maps provide a means to skip unwanted information. This approach is actually an anti-indexing strategy that naturally aligns with the data structures. Zone maps leverage the where-not-to-look principle, which is the only truly scalable approach for bulk scanning analytics. IBM DB2® with BLU Acceleration data management leverages “Synopsis Tables” in the same way.
  • Distribution: One way to get the whole machine working on a query is distribution. It spreads the data across parallel disks and processors. If tables share the same distribution keys, the result is deep joining power that preserves machine concurrency. Bad distribution strategies can saturate the machine with inefficiency. It’s all about the physics.
  • Compression: In other technologies, compression has a significant cost. In Netezza data warehouses, compression actually boosts power. Disk I/O creates most of the power drain on bulk scanning analytics. If the data is compressed by a factor of 4 to 1, I/O drag is reduced by 75 percent. Again, it’s all physical.
  • Referential enforcement: In legacy technology, referential enforcement originated to support transactional integrity. It didn’t really help much for warehousing because it had to be turned off before bulk-loading data. When re-enabling the constraint, if anything was wrong it would have to be corrected in the loading application, not the database. If only the loading application can fix it, what good is the database constraint anyhow? Netezza tosses referential enforcement but leverages referential rules for on-demand optimization of end-user query paths. In a data warehouse, this leverage is how people actually implement referential rules. The appliance simply aligns with industry practice.

Be sure to check out part 2 of this article for a deeper look into what these concepts mean for optimizing performance while keeping data healthy. In the meantime, please share any thoughts or questions in the comments.