Large-Scale Processing in Netezza: Part 2
Transforming more than just data
In Part 1 of this article, we looked at how the IBM® Netezza® platform can process very large, complex data using simple SQL insert/select statements. Many who teeter on the decision to move logic from their ETL machine into Netezza get hung up on a simple question: Can we scale the approach? In short, the SQL insert/select statement has some logistics we must harness in order to rest on a solid approach foundation. Consider this:
Seems simple, or perhaps simplistic. How about this?
Note that in the preceding query, the insert phrase is implied. Some of you will recognize the very significant danger here. If later we add a column into the middle of the pack, it will offset the other columns and misalign them with the target columns. Some say that at least one of the target columns will break... but will they? I’ve seen cases where misaligned queries ran for days or months before anyone caught them. They just so happened to align with acceptable target data types even though the content was all wrong. We must bring this issue under control. Now what if this was a fact table with 150 columns? The insert/select clauses immediately become unwieldy and unmanageable. What if we need to add or subtract columns, or update the query to account for a data model change? These conditions functionally freeze the queries, making the solution brittle. Little changes will shatter its operation. Our best approach is to line up the target columns with the source rules (the elements of the SQL select phrase that will fill the target columns):1
We can extend a template like this to the limits of the database machine. The target columns will always line up and the source/select columns will never offset. Does it sound like we’re signing up for a bit more than just sticking with the ETL tool? Don’t be fooled. The object is to simplify, not complicate the logistics of generating a SQL statement. This structure helps maintain it without a painful impact review. It’s easy enough to scan with simple tools and compare its content with the catalog contents to find, apply, or just report changes. Keep in mind, however, that any time a data model changes, practically all of the columns are pass-through.1 In this case we concentrate on the exceptions, potentially automating over 90 percent of the data model’s reconciliation. Turnaround for a new data model then becomes agile—perhaps very agile. With this kind of power over query generation, we now have the freedom to manufacture lots of queries without losing logistical control. Can we scale the approach? Yes—and it’s already being done. A more advanced form of this template drives the framework engine,2 and it’s not uncommon to see dozens of such queries execute in a controlled order toward a functional goal, which keeps it simple to maintain and reuse. A primary logistical difference between MIMD and SIMD: with MIMD we’re forced into using fewer, more complex SQL statements, which are necessarily serialized because they saturate the hardware. SIMD encourages us to use more and simpler SQL statements and run them in parallel because the machine dispatches them so quickly. So SIMD finishes faster, by orders of magnitude, than its MIMD counterpart. General-purpose MIMD platforms require initial (and perpetual) complexity, but purpose-built SIMD platforms allow us to begin simply and to stay simple forever. And simple means things are easier to maintain, extend, and troubleshoot—creating the foundation for agility throughout the life of the system.
1 Compleat Netezza, Copyright © 2012 VMII ISBN: 978-1-4610-9574-3. Excerpted by permission. 2 Netezza Data Integration Framework, Copyright © 2007-2012 Brightlight Consulting. All rights reserved.