Large-Scale Processing in Netezza: Part 1
Transitioning from ETL to ELT
CIO: Why is that uber-powered [commodity RDBMS] system running out of steam? Didn’t we just upgrade? MANAGER: Yes, but the upgrade didn’t take. CIO: Didn’t take? Sounds like a doctor transplanting an organ. Do you mean the CPUs rejected it? (laughing) MANAGER: (soberly) No, just the users. Still too slow. CIO: That hardware plant cost us [X] million dollars and it had better get it done or I’ll dismantle it for parts. I might dismantle your prima-donna architects with it. MANAGER: I can’t explain it. All the white papers said… (drifts into commodity mumbo-jumbo) CIO: I took the brochure-tour too. All of your people wore T-shirts with the vendor’s logo last week. I see they’re back to plain polo shirts. Did something happen? MANAGER: The expected numbers aren’t there. We’re tuning. CIO: Tuning? They’re off by orders of magnitude. You’re not tuning your way out of this. MANAGER: (tuning out) Yes, I know. (shuffles away mumbling) Don’t reduce managers to mush. It’s just not fair. Why saddle them with the wrong hardware for the job and then expect it to produce what it cannot, because it was never designed to? Outside of a database, “bulk data processing” means one thing: extract, transform, load (ETL) with dozens of tools to choose from. These choices can quickly cascade into a domino effect of necessary infrastructure such as an ETL machine, plus people who know the tool and how to apply it. Scalability is not just about hardware, but also logistics; the goal is to produce more functionality without an army of developers to support it. But what if tool jockeys aren’t required? What if a little UNIX knowledge—plus a good handle on both SQL and our own information—is the core skill set for a large-scale, very successful data processing environment? Bulk data processing inside a database machine comes in a single flavor: an insert/select statement using SQL, usually between intermediate database tables and/or persistent database tables (aka ELT). Apart from the SQL-generation mechanics to support this (no handcrafted SQL statements for business logic!), let’s look under the hardware covers to determine if this approach is viable for larger scales of data. General-purpose platforms are “horizontal,” which means that the CPUs are physically separated from disk drives, connected by a backplane, and shared-everything at the hardware level. Copying means pulling large quantities of data from the disks and over the backplane, then through the CPUs as well as through software with logical representations of the tables, and finally back down the backplane and back onto the disks. With this flow, the data meets itself coming and going on the (saturated) backplane. This necessitates big, fat, monolithic queries because we have to deal with the data while it’s in our hands, so to speak. This model, Multiple Instruction Multiple Data (MIMD), simply does not scale for bulk processing. An IBM® Netezza® platform, by contrast, is a purpose-built, proprietary assembly of carefully optimized commodity parts where CPUs are mated to a disk drive (a shared-nothing hardware). When we create a table, it exists once logically but parts of it also reside physically on each CPU/disk combination—so 10 of these CPU/disk pairs would store a 1,000-record table in 10 locations with 100 records each. A single query addresses all the CPUs at once; this means that the time required to move 10 groups of 100 records in parallel is one-tenth of the time required to move the entire 1,000 as a single block. This model is known as Single Instruction Multiple Data (SIMD). It scales radically for bulk processing, and it’s why a TwinFin® with 90 or more CPUs can process billions of records in a fraction of the time it takes a similarly powered general-purpose competitor. In Part 2 of this article, we’ll take a deeper dive into how this model can radically change the data warehousing process.