Blogs

New ORDER BY Information: Part 2

The impact of using the RANDOM index option on ORDER BY sort avoidance

In Part 1 of this series, aspects of ORDER BY were discussed. Part 2 covers the impact of using the CREATE/ALTER INDEX RANDOM order option on sort avoidance. It also covers the issue of coding only GROUP BY COL1, COL2 versus coding both GROUP BY COL1, COL2 and a follow-up ORDER BY COL1, COL2 in a SQL statement.

A brief review of sort avoidance

Sort syntax such as ORDER BY and GROUP BY does not necessarily cause a data sort. With the appropriate access path, the ORDER BY or GROUP BY requirement can be met without sorting. (An appropriate access path means that an index is available and used so that the index drives the access to the data. DB2 can then return qualified rows to the program in the desired order, the order of the index.)

But what happens when the order of the index column is random? Before DB2 9 for z/OS, we could CREATE indexes with columns that had ascending key values and/or descending key values. As of DB2 9, we can also CREATE indexes with columns that we designate as RANDOM. When we use this option, key values are encoded so that the values are stored randomly. Note that identical key values will encode the same and will therefore be contiguous. Let’s look at an example of index data where one of the columns is RANDOM. Assume an index on DEPTNO ASCENDING, LASTNAME RANDOM, EMPNO ASCENDING in which the decoded values are as follows:

Notice that DEPTNO is in ascending order. LASTNAME is in random order within each DEPTNO, but like values are contiguous. EMPNO is in ascending order within the like values of LASTNAME.

The costs and benefits of RANDOM

Why would anyone want to keep index data in random order? Currently, online transaction processing (OLTP) workloads in a data sharing environment can experience contention on index pages, especially during INSERTs, when an application program INSERTs into indexes with columns that contain current timestamps or ever-increasing values. These column types often create insertion hot spots on index pages. Then applications must wait to acquire busy index pages.

We can use randomized index key columns to reduce contention, but at what cost? There may be more CPU usage and getpage operations, as well as more index page read and write I/Os. The RANDOM option is useful when ascending insertions or hot spots cause this contention but the resulting cost is not prohibitive.

Another possible issue is that while each distinct column value is stored randomly, like-kind values are contiguous. Therefore a randomized index column can relieve contention problems for sets of similar or sequential values, but it’s no help with identical values. Identical values encode the same, and each is inserted at the same place on the index tree.

Here are some interesting facts (some beneficial and some not) about the RANDOM column option.

It allows equality predicate lookups, such as LASTNAME = :LN, but it does not support matching range lookups, such as LASTNAME > :LN. The option can be used in nonmatching index scans in a screening fashion and as part of index-only access. Even though values are stored as random encoded values, we can retrieve the original, decoded value of the column. The option causes RUNSTATS to populate HIGH2KEY and LOW2KEY with the original, decoded value of the column. Finally, the RANDOM column option cannot be specified in the following cases:

  • For an index key column that is of variable length
  • If the index is created with the NOT PADDED option
  • As part of the GENERATE KEY USING clause
  • If the RANDOM column is used to determine the partition location of a table row

Also, DB2 cannot use random order index columns as part of a sort merge join. If a join is between one table that has an ascending index on the join column and a second table that has a randomized index column, the indexes are not in the same order and cannot be merged.

Sort avoidance with RANDOM columns

Now, what happens if the index that would have been ideal for DB2 to use to avoid a data sort is not an ordered list, but rather random?

For the examples that follow, we will use our three-column index on DEPTNO, LASTNAME RANDOM, EMPNO to determine how DB2 might (or might not) use an index for sort avoidance.

1) SELECT DEPTNO, LASTNAME, … FROM BIGTABLE
  WHERE DEPTNO in ('A01', 'B01')
  ORDER BY DEPTNO

For this ORDER BY, DB2 has no problem with the RANDOM column. The RANDOM column is not part of our ORDER BY. We can match on one column, and if we use the index the traditional way, the data will come back in DEPTNO order.

            A01              Yothers …
            A01              Josten …
            A01              Josten …
            A01              Miller …
            A01              Miller …
            A01              Miller …
            B01              Zagelow …
            B01              Bossman …
            B01              Bossman …

2) SELECT EMPNO, … FROM BIGTABLE
  WHERE DEPTNO = 'A01'
       AND LASTNAME = 'MILLER'
  ORDER BY EMPNO

Again, there is no problem here with the RANDOM column. Because like-kind values of the RANDOM column are stored together, DB2 can do an equal lookup, matching on two columns of our index. And, if DB2 uses the index the traditional way, the data will come back in EMPNO order.
        14567 …
        14568 …
        14569 …

3) SELECT LASTNAME, EMPNO, … FROM BIGTABLE
  WHERE DEPTNO = 'A01'
  ORDER BY LASTNAME, EMPNO

Here DB2 can match on only one column, and since LASTNAME is random, DB2 must do a data sort to bring our rows back in LASTNAME, EMPNO order.

              Josten                       21234 …
              Josten                       21235 …
              Miller                       14567 …
              Miller                       14568 …
              Miller                       14569 …
              Yothers                     12345 …

4) SELECT LASTNAME, EMPNO, … FROM BIGTABLE
  WHERE DEPTNO = 'A01'
  AND LASTNAME BETWEEN 'MILLER' AND 'YOTHERS'
  ORDER BY LASTNAME, EMPNO

In this example, DB2 can match only on DEPTNO and must do an index scan, screening on LASTNAME within that one value of DEPTNO to find all of the index entries where LASTNAME is between our low value and our high value. The index rows will not be in LASTNAME order within DEPTNO A01. Therefore, DB2 must sort to satisfy the ORDER BY.
              Miller                        14567 …
              Miller                        14568 …
              Miller                        14569 …
              Yothers                     12345 …

5) SELECT LASTNAME
  FROM BIGTABLE
  WHERE DEPTNO = 'A01'

In this example, there is no ORDER BY. In what order will our rows be returned? Since both LASTNAME and DEPTNO are part of the index, we will get index-only access. We have one matching predicate on DEPTNO. Since DB2 will only use the index the “traditional” way for index-only access, the rows will be returned just like our index data, in groups of identical LASTNAMEs in random order.
                        Yothers
                        Josten
                        Josten
                        Miller
                        Miller
                        Miller

6) SELECT DEPTNO, LASTNAME, COUNT(*)
  FROM BIGTABLE
  GROUP BY DEPTNO, LASTNAME

In this example, DB2 will not have to sort. Our index is in DEPTNO order, and within that order, in groups of identical (but unordered) values of LASTNAME. Therefore, using control break logic, DB2 can return distinct DEPTNO, LASTNAME combinations without sorting. The rows will be in DEPTNO order but will not be in LASTNAME order within each DEPTNO. The answer is accurate because we did not ask that the rows be returned in ORDER, only GROUPed.
            A01               Yothers               1
            A01               Josten                2
            A01               Miller                3
            B01               Zagelow               1
            B01               Bossman             2

Because LASTNAME is random in the index, if we want our rows to be in LASTNAME order within DEPTNO, we must add an ORDER BY clause. This requirement to add an ORDER BY clause in addition to the GROUP BY clause is new with the RANDOM option. Before DB2 9, we needed to add an ORDER BY only if we had more than one index that could be chosen to do the grouping. Remember, we do not have an ASC or DESC option on GROUP BY. Therefore, if we have an INDEX on COL1 ASC, COL2 DESC, as well as an index on COL1 DESC, COL2 DESC, either index can be used to avoid the GROUP BY sort, and the one chosen will determine the order of the output rows. To avoid unpredictable output in this situation, we would need to code the ORDER BY clause (even before DB2 9).

A more realistic example

Consider the following realistic example of a situation where the RANDOM option might be preferable.

A user defines an index on the INVOICE_NUMBER column of the INVOICE_MASTER table in ASCENDING order. The user then inserts rows with an ascending sequence of values for the INVOICE_NUMBER column (0000100, 0000200, 0000300, ...). The index rows will be inserted at the end of the index, creating a hot spot. In this particular index, the user looks up specific invoices only by their INVOICE_NUMBER—in other words, only equality predicates are applied to the INVOICE_NUMBER column. To reduce contention, DROP and re-CREATE the index using the RANDOM option.

With the RANDOM specification, the unique INVOICE_NUMBER values will now be stored randomly throughout the index, preventing the contention that results from always inserting values at the end. Because the common use of this index is for looking up specific invoices by INVOICE_NUMBER, the new RANDOM ordering option provides a good solution.

A caution regarding RANDOM

We must be careful with the RANDOM option and ensure that our solution is not creating more cost from increased CPU, I/O, and sort overhead. The option is best used when we are actually solving a contention problem, when our predicates on the random column are EQUAL predicates or, if not EQUAL, the random columns do not often appear in ORDER BY clauses. The benefit must outweigh the additional cost or the RANDOM option will cause more problems than it solves. And, once we begin using the RANDOM option, we must be very careful to code both a GROUP BY and an ORDER BY when order matters.