The Joy of Continuous Models for Streaming Data

Set up scalable, on-demand data processing in PureData for Analytics powered by Netezza

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

The latest wave of big data requirements sees raw data streaming at high velocity and volume to various ports of call. Data is integrated on the fly at a breathless pace. Most back-end processing aficionados are well aware of the batch-style, load-and-query model. Get everything processed in the dead of night or during the wee-morning hours, and make sure the data is available at first light to end users so they can pound the daylights out of it until midnight. Rinse and repeat. This processing is a very standard solution for many implementations of IBM® PureData™ for Analytics powered by Netezza® technology appliances.

  Brightlight Consulting

Increasing numbers of analytic solutions are moving to a continuous model because they are global and the sun literally never sets on their processing day. The markets close in Singapore, and the end users there want their analytical results processed and underway before the close of markets in the UK, and certainly before the close of markets in the US. If they wait until all markets close, there are not enough hours remaining to get anything done before the end users in Singapore need it. No, they need their data processed as the world turns for the specific reason of avoiding daytime drama—pun intended.

Shining a light on streaming models

Those people in the industry who might say, “Oh, Netezza, it’s just for batch,” should reconsider that sentiment: Netezza shines brightly in the streaming model and requires very little additional consideration to do so. Yes, regard for some simple rules should be taken, but they are so simple and easily accommodated that organizations deploying Netezza should jump at the chance to establish a continuous model. Some refer to this model as a frictionless model because the data doesn’t touch any external disk drives prior to arrival. It might enter a queue or other temporary holding area, but not officially write anything to a disk. There are good and bad reasons to implement processing this way. A good reason is to avoid the processing drag of the external disk drives. A bad reason is that recovery in the case of a processing failure isn’t easy. The external storage provides a natural checkpoint for recovery. A simple appliance such as a two-slice toaster offers one example. The toaster may suffice in a household for a single person, after a marriage, and even after adding a child. But as the family grows, a four-slice toaster may be necessary for more efficient breakfasts than the two-slice variety. However, a 400-slice toaster presents a very different problem, which is no longer toasting bread but managing bread, particularly getting bread to and from the toaster to keep it busy and avoid having any slots sitting idle. This example demonstrates the essential problem of streaming data into a Netezza appliance. It has to arrive at speeds that will keep the machine busy, which is very difficult to do even when using high-bandwidth networks. There is no known technology that can outpace the intake capacity of a PureData for Analytics powered by Netezza technology machine. With this knowledge, organizations need to take care to optimize their surrounding infrastructure so that they don’t completely overwhelm it.

Bursting data flow

If writing to disk may incur a penalty for the outside world, it doesn’t materially affect the flow of streaming data to the Netezza machine. Netezza likes to see a burst of data that is a gigabyte or more, and anything less than this amount can diminish its efficiency. For example, Netezza can load a million records in one second or one record in one second. If a loading event has an immutable overhead, wise management of this overhead deserves consideration. A queuing mechanism collecting and bursting through a thresholding algorithm can significantly increase throughput. What does such a thresholding algorithm look like? Most queuing mechanisms support collect-and-forward processing. The Ab Initio platform calls this a compute point at which there is a declaration to regard the record count or data size to trigger the threshold, and it naturally bursts data to the machine within these boundaries. What if data is running slow, such as between midnight and 6:00 A.M., and this threshold is not crossed for many hours? Rather than waiting for the size, the program can burst whatever it has on timed boundaries. If the setting is every 15 minutes, the queue is flushed, even if it doesn’t meet the size threshold. Wait a second, isn’t the most efficient load over a gigabyte? Sure, but this volume presumes data is collecting rapidly and so should avoid premature bursts. Consider this scenario. One group had millions of records streaming from hundreds of collection servers. These servers would produce small files containing hundreds of records at a very rapid pace. Should the group execute a load on each file? Of course not; rather it would queue up each file until reaching the collection threshold and then burst them over all at once. This approach is very efficient and easy to manage. More importantly, it is easy to recover from in case of data loss. Another client organization had established memory queues to bring data over from various sources into the Netezza machine. One day it realized that one of the queues had gone offline and none of its contents had been integrated into the whole. In its prior existence, the organization would have to delete the data from the machine and then start the reload from a known checkpoint from all sources—the integration logic was in the extract-transform-load (ETL) tool. With Netezza-scaled data, this method is not only impractical, it is impossible. Instead, what is needed is a way to recover the data and reintegrate the information into what’s already been processed.

Applying a scalable approach

One of the primary problems with processing data in scale is the need to reprocess the data after some period of time has elapsed. For example, a financial services client ages its records by calculating certain values and applying them to the whole. The line-of-business users then see real-time aging of the information that directly affects their analytics results. Another group must perform regular error correction to already-processed data. Its upstream feeds send down the corrections, and those corrections must be applied in a timely manner. Many other data-refresh examples abound. On another note, some end users require that their data be reprocessed on demand. They want to punch a button, have their designated data plumbed, and algorithms applied and available—often in less than a minute. In a Netezza machine with billions of records and breathtaking scales, data reprocessing can take some very brute-force turns if not handled correctly. For example, another client organization was pulling the data from the Netezza appliance into the ETL tool, reprocessing it, and reloading it. The organization was accustomed to this protocol because all other database engines require it. A load-balancing engine such as Microsoft SQL Server, Oracle, and so on cannot easily process the data internally—using for example, a SQL-only transform. Such an operation could bring that platform to its knees, dim the lights, and cause blackouts on the Eastern seaboard. Conversely, this same protocol with Netezza requires the data to fully serialize on its way out, and then over the network, where it may re-parallelize in the ETL tool, and then once again serialize over the network to re-parallelize in the Netezza machine. Whew! That’s a lot of work, and for transport only. A secondary problem is the processing itself. Many of the integration operations do heavy-lifting joins and set-based scans that put increasing pressure on the ETL platform. As it experiences stress, those idle processors inside the Netezza machine start looking very tempting indeed. If an organization were to forgo the ETL approach for the Netezza solution, and instead apply SQL transforms to the problem, the data can move rapidly inside the machine using massively parallel data movements without ever leaving. The total difference in duration is sometimes astounding—once-multihour operations now finish in minutes, if handled internally. Getting this level of turnaround is definitely worth the investment because it directly affects the scalability and capacity of the solution. What does this approach mean for on-demand processing? Or for reprocessing stale or erroneous data? With the proper planning, organizations can easily implement such solutions and have the best of a hybrid model. The ETL tool gets the data sourced and into the machine, and then SQL transforms can be performed inside the machine to finish it off. The integration logic and the integration pressure are now inside a massively parallel machine. Moreover, late-arriving data doesn’t have to be pulled out into the ETL tool again because it will just reprocess internally. If an end user has a need for reprocessing on demand, no problem. Just build that solution and leverage the machine’s power. How much easier is this approach than doing it with ETL? At one site the developers demonstrated an ETL flow that when printed, literally stretched across the length of three eight-foot tables. It had all kinds of reprocessing logic and backwashing of the data. One of the SQL-transform developers converted all the logic to three SQL statements. What does that say about using power to simplify both logic and processing?

Powering efficiency for complex processing

Netezza appliances offer a very powerful machine that is versatile enough for a wide variety of solutions. And this power gives organizations the ability to help simplify complex operations that are artificially complex precisely because their platforms are underpowered for the task. PureData for Analytics powered by Netezza technology enables alleviating the artificial complexity and using its capabilities to help simplify and clarify the solution. Once organizations are in this zone, scalability is naturally preserved and can take them to dizzying heights. Please share any thoughts or questions in the comments.