Easing the Heavy Lifting of Bulk Data Processing

Generate SQL from a template, and voilà! Portable cross-database references

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

In launching a data warehouse flow architecture, a happy existence forms between the extract-transform-load (ETL) tool and the IBM® PureData™ System for Analytics powered by IBM Netezza® technology. Data arrives and is ensconced in its various tables, and end users consume the information at mind-blowing speeds. However, when the data starts to grow and the complexity increases, the ETL heavy-lifting integration operations experience elevated stress, resulting in a simple change in the winds of sentiment. Where should we perform all this heavy lifting, especially considering that Netezza possesses order of magnitude more power than the ETL platform? We may first encounter this situation when a needed operational post-process requires us to execute some massive operation on data already in the box. Processing data on the way is one thing; pulling data out en masse, transforming it, and putting it back—for example, lift and shift—is quite another. Why? Data in the box is already parallelized. Taking it out will, by definition, serialize it. Even if the ETL tool can re-parallelize it, it would have to re-serialize it to put it back. And then, network latency and any serialization of the data on the local storage area network (SAN) occurs. The additional drag effect—latency—that is created can jeopardize allocated time to close the operation.

Falling short when overhead is unacceptable

ETL tools have a latency factor that affects their data round-trip viability for such an operation. For example, if we have 500 records to update or affect, we’ll incur the pain of pulling these records out of the database, applying the change, and then loading them back to the database. But what if these records exist among 5 million records in a table containing 50 million records, and we need to pull all 5 million records, determine what needs to change, and then put the changes back into the database? This activity has a time cost that will only grow as the data increases. And if we’re doing this operation with a platform such as Oracle, Microsoft SQL Server, or any commodity load-balancing engine, doing this kind of operation in a database would take many times longer than using the ETL tool. Both will take a protracted duration to complete; we just want the shortest path. However, with PureData powered by Netezza technology, this equation is reversed dramatically. Performing this sort of operation in the database always requires a fraction of the time the ETL tool requires, especially considering the round-trip of data going into and out of the box. With complexity and data growing, and timelines becoming compressed, one can rapidly encroach on the finite processing timeline. But what if the operation is part of an intraday turnaround? What if an end user requests some rapid-turnaround, deep-crunch analytic process that precludes the viability of an ETL tool? Can we use the ETL tool for in-the-box query processing—SQL transforms? For example, we have one client that needs 40 heavy-lifting SQL operations over billions of rows, executed inside of 60 seconds. The queries themselves take a total of 35 seconds, leaving only 25 seconds of overhead remaining. Even if the ETL tool executed SQL transforms only, it would gobble that 25 seconds in the first few operations, considering they can take anywhere from 5 to 10 seconds in latency for each. Ten seconds of latency for 40 queries is 400 extra seconds, which completely blows the 60-second window. This example demonstrates an extremely aggressive timeline to make a point. The ETL tool is not the appropriate tool to support in-the-box operations if overhead is unacceptable.

Going with templates for SQL generation

The latency of the ETL tool is particularly unacceptable for organizations that have moved to the PureData and Netezza technology platform. Similarly, the bitter pill they may have once swallowed in attempting such operations within a traditional commodity database can make them hesitant to move forward with this new, powerful platform. Note some of the following valid concerns that exist around embracing this approach:

  • Out-in-the-open SQL statements that are handcrafted and without proper logic or error controls
  • Lack of operational and logistical control of the SQL operations that run freely
  • Lack of visibility of the runtime state, especially if running inside a stored procedure
  • Handcrafted SQL such that any change to the data model can initiate a rewrite
  • Optional SQL tasks that create dynamic tables or assets for later operations that can break if the table is not present—linear dependency
  • Lack of visibility into linear processing and data lineage at the table and column level
  • Inability to make sure the developers write the most efficient SQL all the time—and automatically detect when they don’t

While this list seems daunting, extensive production testing and rollout have shown that the foundation pattern used to resolve these and other concerns is not complex, and aligns with the simplicity of Netezza capabilities. The answer is in generating the SQL from a template. The template simply sets up some basic instructions at the top as a header, lists the given target table’s columns, and provides a way to put a SQL rule to the right of the designated column. It sets a free-form filter clause where both custom join and logic are inserted. This approach offers the following highly beneficial fallouts:

  • Lineage: If the SQL is generated from the template, the column- and table-level lineage can likewise be generated. They will always be in sync.
  • Catalog dependency: If the database and table used to derive the template’s contents are included, validating the template against its original source is easy, which helps automatically sync them. Think about it; most data model changes are pass-through columns. This single aspect can help eliminate 90 percent of the impact review.
  • Developer harness: Developers work with the template and nothing else. They perform no creative engineering or out-of-balance priorities.
  • High performance: By examining the template’s contents, which is more structured than free-form text, we can establish automation around assessing its efficiency. For example, is it an update operation that is neglecting to use the distribution key? We can examine other deep-detail efficiency issues, and designate which templates should run independently—side by side. This designation is not an option with a stored procedure.
  • Operation: If a controller is picking up the templates, formulating SQL, and running them, there is high visibility simply having it stream running status to a log or reporting progress to a progress table. We can apply global controls to all the SQL operations without relying on developer expertise to deliberately include it or neglect it.

Deriving a little magic from the template

Is there a trick to formulating such a template? Actually, many examples already exist of common table-definition formats that dovetail directly into specific needs; fortunately, they are all easy to rip through with a simple shell script. The following example provides a public domain format featuring the aforementioned header, body, and filter clause:

TARGET_DATABASE=REPORTDB TARGET_TABLE=EMPLOYEE <other actions here> [TARGET COLUMNS] [EMP_ID] a.empl_id_num [EMP_NAME] a.empl_lname || ‘, ‘ || a.empl_fname [EMP_SSN] a.empl_ssn [EMP_START_DATE] a.empl_st_dt [EMP_STATUS] a.empl_status_indicator ############## [FILTER CLAUSE] From $SOURCEDB_EMPLOYEE_MASTER

Some subtle magic takes place in this template. First, the target columns are called out specifically, and the SQL rule associated with the column stands alone—the rule is not commingled. We won’t have to find it through arbitrary SQL parsing. Second, manufacturing a valid query for this template is a short step away. More importantly, a significant hurdle in managing SQL transforms—and a bane for stored procedures—is effective configuration management of the cross-database references and table relations; they never get lost when promoting the application from development to testing to production. By pushing the entire reference into a $variable, the reference is completely portable. And the capability to map table dependencies through it provides an enhanced side benefit. Of course, this article can only scratch the surface of what such a template can do. Whether the end result is text-related or part of a visualized point-and-click interface, the intent is to keep the developer in a fairly constrained swim lane of activity while maintaining objective, deterministic, and highly scalable business logic. Please provide your thoughts or questions in the comments.