Blogs

Optimizing Backup Images for Data Deduplication Devices

Taking advantage of DB2 Version 9.7 deduplication device support

DB2 for LUW Offering Manager, IBM

Data deduplication is dramatically improving database environments by minimizing storage requirements, accelerating backup and recovery, and reducing network traffic. But before the release of FixPack 3 for IBM® DB2® Version 9.7 database, if you wanted to optimize a DB2 backup image for a deduplication device, you had to set several BACKUP DATABASE command options appropriately. Otherwise, you ran the risk of generating a data stream that the deduplication device you were backing up to was unable to use to identify redundant “chunks” of data.

To make backing up a DB2 database to deduplication devices easier (and to make the deduplication of backup images more effective), IBM introduced the DEDUP_DEVICE option for the BACKUP DATABASE command in DB2 v9.7, FixPack 3, and improved its behavior in FixPack 4.In this column, I’ll describe what deduplication is, and I’ll explain how it is often implemented. I’ll also show you how DB2 backup operations are performed, both when the DEDUP_DEVICE option of the BACKUP DATABASE command is specified and when it is not. Finally, I’ll provide you with some recommendations on how to optimize DB2 backup images for deduplication devices if you are using a version other than DB2 v9.7, FixPack 4.

What is data deduplication and how is it implemented?

Data deduplication (sometimes called “intelligent compression” or “single-instance storage”) is a specialized form of data compression that’s designed to eliminate redundant data. Much like other forms of compression, deduplication works by inspecting data and identifying sections that have identical byte patterns. When such patterns are found, only one unique instance of the data is written to storage; duplicate occurrences are replaced with a “data pointer” that references the previously stored version. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be physically stored (and transferred across a network) when deduplication is used can be greatly reduced.

For example, suppose an email system contains 100 instances of the same 4 MB attachment. If this email system is backed up without deduplication, all 100 instances of the attachment are saved, requiring 400 MB of storage. However, if the same email system is backed up to a deduplication device, only one instance of the attachment is actually stored; each subsequent instance merely references the copy that was saved. Thus, the 400 MB of storage needed to back up the system is reduced to 4 MB!

As mentioned earlier, most deduplication devices work by comparing relatively large “chunks” of data, such as entire files or large portions of files. Each chunk examined is assigned an identifier, which is typically calculated using cryptographic hash functions. In many implementations, the assumption is made that if an identifier is identical, the corresponding data is identical. Other implementations forego this assumption, preferring instead to do a byte-by-byte comparison to verify that data with the same identifier is indeed the same. Regardless, if it is decided that a particular chunk of data already exists in the deduplication namespace, that chunk is replaced with a link to the data that has already been stored. Later, when the deduplicated data is accessed, if a link is encountered, it is replaced with the data the link refers to. Of course, this whole process is transparent to end users and applications.

Typically, deduplication is performed using one of two methodologies: “in-line” or “post-process.” With in-line deduplication, hash calculations and lookups are performed before data is written to disk. Consequently, in-line deduplication significantly reduces the raw disk capacity needed since not-yet-deduplicated data is never written to disk. For this reason, in-line deduplication is often considered the most efficient and economic deduplication method available. However, because it takes time to perform hash calculations and lookups, in-line deduplication can slow some operations down, although certain in-line deduplication solution vendors have been able to achieve performance that is comparable to that of post-process deduplication.

With post-process deduplication, all data is written to storage before the deduplication process is initiated. The advantage to this approach is that there is no need to wait for hash calculations and lookups to complete before data is stored. The drawback is that a greater amount of available storage is needed initially since duplicate data must be written to storage for a brief period of time. This method also increases the lag time before deduplication is complete.

How a traditional DB2 backup operation works

To understand how the DEDUP_DEVICE option of the BACKUP DATABASE command optimizes DB2 backup images for deduplication devices, it helps to know how data is normally processed when a backup operation is initiated. When a DB2 backup operation begins, one or more buffer manipulator db2bm threads are started. These threads are responsible for accessing data in the database and streaming it to one or more backup buffers. Likewise, one or more media controller db2med threads are started. These threads are responsible for writing data residing in the backup buffers to files on the target backup device. (The number of db2bm threads used is controlled by the PARALLELISM option of the BACKUP DATABASE command; the number of db2med threads used is controlled by the OPEN n SESSIONS option.) Finally, a DB2 agent db2agent thread is assigned the responsibility of directing communication between the buffer manipulator threads and the media controller threads. This process can be seen in Figure 1.

Figure 1: The DB2 backup process model.

Normally, data retrieved by buffer manipulator db2bm threads is read and multiplexed across all of the output streams being used by the media controller db2med threads—and there is no deterministic pattern to the way in which data is placed in the output streams used. (This behavior is shown in Figure 2.) As a result, when the output streams are directed to a deduplication device, the device thrashes in an attempt to identify chunks of data that have already been backed up.

Figure 2: Default database backup behavior. Note that the metadata for a table space will appear in an output stream before any of its data and that empty extents are never placed in an output stream.

 

How the DEDUP_DEVICE option alters backup behavior

When the DEDUP_DEVICE option is used with the BACKUP DATABASE command, data retrieved by buffer manipulator db2bm threads is no longer read and multiplexed across the output streams being used by the media controller db2med threads. Instead, as data is read from a particular table space, all of that table space’s data is sent to one, and only one, output stream. Furthermore, data for a particular table space is always written in order, from lowest to highest page. As a result, a predictable and deterministic pattern of the data emerges in each output stream, making it easy for a deduplication device to identify chunks of data that have already been backed up. Figure 3 illustrates this change in backup behavior when the DEDUP_DEVICE option of the BACKUP DATABASE command is used.

Figure 3: Database backup behavior when the DEDUP_DEVICE option is specified.

So, just how much of a difference does this simple change in behavior make? According to IBM, one customer that was backing up 4 TB in 6.5 hours and seeing very little in the way of deduplication was able to reduce the time needed to perform the backup by one hour while achieving a deduplication result of 11:1!

Backing up to a deduplication device when the DEDUP_DEVICE option is not available

If you want to back up a DB2 database to a deduplication device and you are using a version of DB2 other than DB2 v9.7, FixPack 4, there are several things you can do to optimize your backup images for deduplication:

  • If possible, use a buffer size of 16,384 (specified with the BUFFER option of the BACKUP DATABASE command). If there is not enough memory available to support this setting, use a buffer size of 8,192 instead. Larger buffers will often improve the factoring ratio on a deduplication device.
  • Set parallelism (using the PARALLELISM option of the BACKUP DATABASE command) to the minimum value needed to read data at the backup rate recommended by the vendor of the deduplication device you’re using.
  • Set the number of sessions (using the OPEN n SESSIONS option of the BACKUP DATABASE command) to the minimum value that will allow DB2 to write data at the backup rate recommended by the vendor of the deduplication device you’re using.
  • Use the following equation: (PARALLELISM value + OPEN n SESSIONS value + 2) to determine the appropriate number of backup buffers to use and set the number of buffers accordingly (using the WITH n BUFFERS option of the BACKUP DATABASE command).

Usually, the smaller the parallelism and number of sessions used, the better the deduplication factoring ratio. But keep in mind that this comes at a cost in terms of a longer amount of time for a backup operation to complete.

The primary reason for the introduction of the DEDUP_DEVICE option in DB2 v9.7, FixPack 3, was to optimize DB2 backup images for deduplication devices and to simplify the backup operation when such devices are used as a target for DB2 backup operations. (FixPack 4 contained some enhancements that improved the behavior when this option is used.) So if you’re using a deduplication device for backup and recovery, and if you are using DB2 v9.7, FixPack 4, take advantage of this new feature. Chances are it will shorten your backup window and improve your deduplication results.

Special thanks to Dale McInnis, senior technical staff member—DB2 availability architect, for providing information on how DB2 backup operations are performed—with and without the DEDUP_DEVICE option specified—and for suggesting how to back up to a deduplication device when older versions of DB2 are used.