DB2 Indexes and Query Performance: Part 1
Weigh costs and benefits when considering additional indexes
For plenty of you, indexes are the duct tape of DB2 performance fixes. Got a response-time issue? Add an index to the target table. In truth, reducing the elapsed time of a DB2-accessing query or batch job frequently involves creating a new index; however, while an additional index is very often a solution to a DB2 performance problem, it is not always the right solution. In this first part of a two-part series, I’ll point out the costs that go along with the benefits of DB2 indexes, sketch out some rules for when to consider other solutions, and talk about ways to create more effective indexes. Next issue, I’ll describe some query performance-tuning actions that do not involve new-index creation and that you can implement when a table is already heavily indexed and you are leery of adding more.
There’s no free lunch
Indexes are definitely not without cost. Every additional index that you define on a table adds to the costs associated with your database. Start with CPU overhead: every index on a table multiplies the CPU cost of each insert or delete operation, or the CPU cost of any update that changes the value of an indexed column. Insert a row into a table with three indexes, and DB2 must add a corresponding entry to an index three times. Delete a row from a table with 10 indexes, and DB2 must remove all index entries associated with the deleted rows.
All of this index maintenance activity involves DB2 page requests—called GETPAGEs in an IBM® DB2® for z/OS® system and logical reads in a DB2 for Linux, UNIX, and Windows [LUW] environment—and these tend to be the primary determinant of the CPU cost of SQL statement execution.
Extra indexes also add to the CPU consumption of certain DB2 utilities, such as the REORG utility. Indexes defined on keys that are not continuously ascending will naturally become less well-organized over time. New entries must be placed in particular index leaf pages (because index-key sequencing is strictly enforced, whereas data-row clustering in a table is not), and if the page into which an entry must be placed is already full (not uncommon when the key is not continuously ascending), the page will be split and some of the entries will be relocated to a formerly empty page that might be a long way away in the index structure. REORG will restore a disorganized index to a well-organized state, at a cost of some CPU consumption.
Other utilities that get more expensive as more indexes are defined on a table include LOAD and RUNSTATS (index statistics in the DB2 catalog must be kept up-to-date), and the utilities that perform index-to-data consistency checking (that would be CHECK INDEX for DB2 for z/OS, and INSPECT with INDEXDATA for DB2 for LUW). Indexes also have a disk space cost, although that doesn’t need to be as brutal as it would have been a few years ago, thanks to the index compression features of DB2 9 for z/OS and DB2 9.7 for LUW. On both platforms, you can now get large space savings (often 50 percent or more) with little in the way of additional CPU overhead.
How many indexes?
As the number of indexes on a table grows, eventually the aggregate CPU cost of maintaining the indexes can outweigh whatever performance gains a new index might deliver. But where is that tipping point? The answer depends on the specific situation, but I generally don’t like to see more than 4 or 5 indexes on a table in an online transaction processing (OLTP) environment, or more than 8 to 10 indexes on a table in a data warehouse system. I’m usually OK with a larger number of indexes per table in a data warehouse, because the emphasis there is typically on optimizing data retrieval performance, and query search arguments tend to be less predictable in a business intelligence setting.
Now, these index-limitation guidelines are more rules of thumb than rules to live by. Suppose a table in an OLTP database already has 6 or 7 indexes. Would I give a thumbs-up to adding another? Maybe, if the potential performance payoff is really big—and when I say big, I mean that the new index is expected to reduce the CPU cost of one (or several) of the more expensive queries in the system by 90 percent or more. Even if the proposed new index looks like a performance winner, before going ahead with implementation I might want to see if there are any good tuning alternatives that do not involve creating a new index (that will be the topic of Part 2 of this column).
If you’re going to do it, do it right
When you do add an index to a table to cut the CPU cost and run time of a query, you want to get the most out of that new index as practically possible. That means, among other things, getting the order of the key columns right if it’s a multi-index key. Why is that important? Because a query will run faster—sometimes much faster—if the query’s predicates (its search arguments) match more columns of an index key; and when it comes to DB2’s rules for matching predicates with index-key columns, order matters.
Here’s a simplified (and therefore not all-inclusive) version of the rules for the most common predicate types, which include equal-type, range (for example, > or <=), and in-list (for example, COL1 IN ('DOG', 'CAT')):
- For a multicolumn index key, predicate matching starts with the high-order column and proceeds as far as possible with other key columns, one at a time, in the order of their sequence within the index key. Put another way, matching starts at the far left of the index key and proceeds rightward.
- Getting a match on the column in position n+1 in the index key requires matching on the column in position n with an equal-type predicate or an in-list predicate. In other words, once a key column has been matched with a range predicate, there won’t be any matching on columns to the right of that column.
- Matching stops when a column in the key is skipped. That is, if you get a match on the column in position n of the key, and there is not a matching predicate for the column in position n+1, there can’t be a match on the column in position n+2.
So, for example, if you have a query with predicates COL1 > 2 AND COL2 = 'BRICK', an index defined on COL2, COL1 will give you two predicate/key-column matches, while an index on COL1, COL2 will give you one match (refer to rule 2 in the preceding list: matching stops once a key column has been matched with a range predicate).
Use indexes, but choose wisely
Indexes can really turbocharge DB2 query performance, but if you put too many on a table, you could end up with response time and throughput going in the wrong direction. Start out conservatively, with just a few—maybe two or three—on each table in your database. That way, you’ll have some reserve index-add capacity, and you’ll probably be able to define additional indexes that will look really good from a cost-benefit perspective. Just don’t overdo it, and don’t waste precious index-add capital on indexes that deliver only marginal query performance benefits—you’re looking for the big wins here.
In my next column, I’ll show you that there are plenty of things you can do to improve query response times without creating new indexes. Combining those techniques with judicious use of additional indexes is the two-pronged performance-tuning approach that is most likely to get you where you want to go.