The Agony and Ecstasy of Scalable Disaster Recovery

Deploy PureData for Analytics, powered by Netezza technology, to set up scalable disaster recovery

Senior Principal Consultant, Brightlight Business Analytics, a division of Sirius Computer Solutions

The average data manager may have a somewhat tense existence if the systems aren’t protected from common outages. Whether for regular backup or for fully operational hot-swappable data centers, the ever-present need to protect information assets requires deliberate attention, extra infrastructure, and often secondary supporting technologies that dovetail to the solution base in a noninvasive manner. When introducing a solution that implements IBM® PureData™ for Analytics, powered by IBM Netezza® technology, the sheer scale that is achievable by the smallest of these machines has the power to crush an infrastructure and lay waste to highly advanced and complex backup architectures. Imagine a museum filled with priceless treasures. It has air-conditioning systems that pull human body gasses into the floor and away from the artifacts. It has special lighting so that commodity lamps do not artificially age or wash out the colors in the finest artworks. Many other painstakingly installed features serve to deliberately and delicately preserve each piece in a manner befitting its construction and underlying parts. Then one day, Godzilla plants his foot across the museum and, just by showing up, undoes everything in a matter of seconds. Not that Netezza is Godzilla, but it is certainly harnessing one—or several—Gozilla-sized solutions on the inside. The scale of data inside the machine can be a point of pride for business managers, stakeholders, and the like. It can also be a source of continuous anxiety for those who are charged with protecting it the same way they protect the data in other machines. When Netezza data is at rest, it is nominally compressed at four times or more. This compression means a 10 TB data store can decompress to 40 TB if extracted by a commodity backup technology. Consider a particular Netezza site that has a single table that is 35 TB in size, compressed four times. As a result, a commodity backup utility performing a full backup must extract 140 TB of data. Regardless of whether it intends to compress it for later storage, this largesse is what actually hits the network, which could translate into many long days to transfer just this one table.

Applying a highly effective backup technology

With Netezza, bridging or adapting the device to the outside data management world is a common theme. In regard to nzbackup and nzrestore, each of these onboard utilities can extract and replace the data without decompressing it, and drop it into a flat file. This capability allows a commodity product to manage these—much smaller—flat files without requiring a direct interface to the Netezza machine. It also intrinsically provides for a checkpoint-based extract rather than a monolithic extract, which means it doesn’t try to choke on it all at once. In data warehousing, the protocol of a protracted duration operation is inherently understood. If a given job can run in two hours but only a three-hour window is available, the job’s checkpoints need to be configured so that it can recover from the checkpoint and continue rather than having to start from scratch. Yet, in full backup scenarios, many data warehouses treat these operations as monolithic operations. Netezza backups can end up spanning many hours and even days. Keep in mind that if the backup operation fails and it has to begin again, it loses all the time it’s already spent. Clearly, the backup operation must chop the work into manageable chunks so that if one chunk should fail, a simple restart from the failed chunk can be made without going back to the very beginning. When having a casual chat with the backup administrator about the new machine’s needs, each initial response will be preceded by the same two words: “What the…?” And the reason is because such vast quantities have never been spoken of—nor planned for—and the amount of data coming from the Netezza machine is expected to eclipse the combined contents of all the administrator’s other machines. The most effective backup technology for a Netezza machine is … another Netezza machine. Onboard tools and utilities can easily and rapidly extract and redeploy information across these machines without decompression or any egregious file management logistics. Applying automation so that operators have less to worry about is easy. In fact, at Brightlight there are fully automated, inter-Netezza replication and archiving—delete behind—operations that are omnidirectional, many to one, or one to many, and they throttle the workload to keep it from swamping the network bandwidth.

Bringing scalable disaster recovery into play

This discussion now takes an unexpected turn. The machine’s internal scalability requires administrators to think differently, perhaps even overthink the need to protect its information assets. Replicating from a production Netezza server to another similar server at a secondary site provides a means to objectively verify that the data assets are preserved. But what happens in case of a disaster? Two primary scenarios arise: in one scenario the data center is destroyed, and in another the Netezza machine goes offline. If the primary site goes down as the result of a true disaster such as a fire, flood, tornado, or so on, then the off-site secondary Netezza machine is essentially orphaned. Nothing is feeding it because the primary Netezza machine is offline. However, what if the primary site is live but only the primary Netezza machine is down? Is the intention to point the business intelligence (BI) tools to the secondary Netezza machine, and push the extract-transform-load (ETL) work to this machine, through a very thin network pipe between the primary and secondary sites? A preliminary test of this constrained bandwidth reveals all: having that secondary machine at the secondary site is highly problematic if the primary machine has an outage. It cannot be operated from a distance. But what if two fully operational sites are humming? In this case, the primary Netezza machine doesn’t need to feed the secondary machine because it’s fed by the secondary ETL machines at the secondary site. Perhaps installing a sync check would be helpful so an operator could tell if the two machines were keeping up with each other. This scenario underscores a common misconception of disaster recovery (DR) in terms of scale rather than function. Consider a radical thought. If the DR protocol is entirely Netezza-centric—data managers care only if the primary Netezza machine crashes, not the entire data center—the DR Netezza machine should be physically colocated with the primary machine in the primary data center. Therefore, if the primary machine has an outage, the secondary one can take over and enjoy the same scalable bandwidth as the primary machine. Yet, there are many sites that have their Netezza-centric DR machine hosted in a physically distant location, and available only through a thin network pipe that won’t support the machine’s scale if it has to go live. Of course, many of them only care to back up the data and not actually go live with the secondary machine, which is a well-suited solution. As with all things associated with keeping a backup, the backup is worthless if it doesn’t readily—and accurately—support restoration. After all, nobody really cares how long a backup cycle takes to complete. Nobody’s watching, and it’s happening in the background. But in case of a data loss, the need to restore it is often a top priority with fire-breathing end users and managers watching the restoration as if the administrators are living in a fishbowl. In this situation, the restoration had better be right, and it had better be right quick. Why then do the vast majority of backup tools and approaches focus on the speed of the backup process and not the restore process? When it comes to systems of scale, the speed of restoration trumps the speed of backup. In fact, having the backup run a little longer than it normally would is even acceptable if the extra time is used to manufacture metadata and operational collateral that ultimately supports a high-speed restoration process. What does this situation say about a commodity backup that may require many days to extract and keep the information safe? It means that many days will be required to put it back. Suffice it to say, many days is not really a scalable option.

Helping keep scalability simple

PureData for Analytics, powered by Netezza technology, is designed to be standardized on simplicity. Embracing easy, scalable approaches for backup, restore, and disaster recovery means high reliability, less guesswork, and of course, scalability with the machine itself. Organizations cannot afford to have their data management processes fall behind. Fortunately, Netezza never sleeps. Its data is always moving and always growing. Replicate between Netezza machines with Netezza-compressed data to help get highly scalable and reliable backups. Focus the infrastructure on high-speed restoration as the highest priority. Please share any thoughts or questions in the comments. [followbutton username='enzeevoice' count='false' lang='en' theme='light']

 <table valign="top" width="15%>

  [followbutton username='IBMdatamag' count='false' lang='en' theme='light']