What’s the Big Deal About Query Optimization? Part 2
Take account of recent improvements and where query optimization may be headed
Though the complexity of SQL queries has grown significantly, recent enhancements provide the opportunity for query optimization. The first installment of this series focuses on the challenges of increasingly complex SQL in relational databases. This concluding installment looks at various techniques representing recent enhancements database administrators (DBAs) can use to optimize queries in relational databases.
Capitalizing on recent advancements
Predicate transitive closure (PTC) is nothing more than the capability to derive a predicate that the end user did not specify. The following inference provides a simple example:
A = B and B = C; therefore, A = C
This closure can be in the assignment of host variables, as follows:
WHERE T1.C1 = T2.C1 AND T1.C1 = ? generates T2.C1 = ?
Or across tables, as follows:
WHERE T1.C1 = T2.C1 AND T1.C1 = T3.C1 generates T2.C1 = T3.C1
This closure can occur for equal predicates, range predicates, IN lists, and so on. In addition, it allows the optimizer to push a predicate from the outer select to a subselect or the reverse.
Conversion of left outer join to inner join
When a predicate exists on the right side of a left outer join, this join can be converted to an inner join in which the following example:
FROM EMP E
LEFT OUTER JOIN DEPT D
ON E.DEPTNO = D.DEPTNO
WHERE D.DIVISION = 'ARC'
Is converted to:
FROM EMP E
INNER JOIN DEPT D
ON E.DEPTNO = D.DEPTNO
WHERE D.DIVISION = 'ARC'
Histograms provide the standard technique used to estimate selectivity factors for predicates on a single table. They come in various flavors, including equiwidth, equiheight, and max-diff. Of these, equiheight is the most popular histogram because it is the most accurate one. Consider an example that illustrates both the reality (see Figure 3) and how the optimizer sees things (see Figure 4).
Figure 3. The reality of varying numbers of customers across equal time intervals
Figure 4. The query optimizer’s view of selectivity factors without a histogram (left) for one set of equal time intervals and with a histogram (right) for another set of different time intervals
Uncertainty and safe optimization
One way to illustrate uncertainty and safe optimization is to consider the following example. Suppose a person has a choice of two routes to work each morning, either by taking the freeway or by taking surface streets. The freeway normally takes 30 minutes, but on a bad day the commute can take up to two hours. The alternative route requires an average of 45 minutes, but even on a not-so-good day can take up to one hour. Which route is the best one for this person to take?
Safe optimization is a process in which the optimizer quantifies the cost uncertainty and incorporates it during access path selection, especially index selection. When two indexes have a close cost estimate, the database optimization process considers the uncertainty of matching and screening predicates when choosing the most efficient index. The optimization process may opt for an index that has a slightly higher cost estimate than if that index has a reduced cost certainty (see Figure 5). Safe optimization attempts to cater to the average case and the worst case. For example, an access path with average = 2 seconds and worst case = 4 seconds in a defensive optimization is preferred over an access path with average = 1 second and worst case = 10 seconds.
Figure 5. The optimation process when quantifying cost during access path selection
An eye to the future
The basic idea behind the pioneering of parametric optimization3 is in recognizing that a single access path may not be good for all variations of host variables. One size does not fit all. Multiple access paths for a single query are generated and stored—for example, index access for pending transactions versus a table space scan for confirmed transactions. Different values of host variables are routed to use different access paths. If this optimization sounds like REOPT or bind peeking, it is, but there is no runtime creation of the access path; one of several pre-generated paths must be selected (see Figure 6).
Figure 6. The selection of pre-generated paths
From a query optimization perspective, organizations face some of the following challenges:
- How many access paths per SQL statement need to be stored?
- How can the runtime overhead of deciding the right access path be kept low?
- Are there additional memory constraints?
- Is guessing wrong still possible?
- As a matter of end-user control, is there the capability to specify special values? For example, program x always asks for a wide range of dates, while program y always asks for a narrow range of dates.
The basic idea behind a learning optimizer is to monitor queries as they execute—that is, self tuning—and no end-user action is required. It compares the optimizer’s estimates with actuals at each step in a query execution plan and then computes adjustments to its estimates that may be used during future optimizations of similar queries. Ideally, estimation errors can also trigger re-optimization of a query in mid-execution (see Figure 7).
Figure 7. The learning optimizer process of comparing estimated cardinalities with actual cardinalities
At some point, the optimizer will not know what the end user wants, and some form of end-user control is essential. For this control to be possible, the following items are necessary:
- Criticality of the query—similar to the optimization level in IBM® DB2® for Linux, UNIX, and Windows data management—or the amount of time an end user should be prepared to wait—for example, in an ad hoc query versus a batch job run many times
- Degree of paranoia, or the safety factor—average versus worst case
- Simple hint process as part of an access path only—for example, just table order, just a specific index, and so on
- Anti-hint—anything but the chosen access path
- Feedback to the end user on who the runner-ups were, making creating a hint even easier than was previously possible
Focusing on the right problems
In a recent blog, Dr. Guy Lohman asks if query optimization is a solved problem.4 According to Lohman, accuracy of cardinality estimation is the key area in which errors are made by orders of magnitude. Researchers seem to be focusing on problems they can solve rather than the ones they should be solving. As Lohman states in this blog, “So c’mon, folks, let’s attack problems that matter, those that account for optimizer disasters, and stop polishing the round ball.”
Query optimization is not easy, after all. Cost estimation can always have challenges in the presence of host variables—all static SQL—as well as data skew and correlation. While huge improvements for the optimizer have been made, and continue to be made, some form of adaptive optimization—usage-based analysis—is essential, and enhanced end-user control—search directives—is inevitable.
Please share any thoughts or questions in the comments.
3 “Picasso Database Query Optimizer Visualizer,” version 2.1, Database Systems Lab, Indian Institute of Science, February 2011.
4 “Is Query Optimization a “Solved” Problem?” by Guy Lohman, Association for Computing Machinery (ACM) Special Interest Group on Management of Data (SIGMOD) blog, April 2014.
- “Towards a Robust Query Optimizer: A Principled and Practical Approach” Proceedings of the 2005 ACM SIGMOD international conference on management of data, by Brian Babcock and Surajit Chaudhuri, June 2005.
- “Query Optimizers: Time to Rethink the Contract?” Proceedings of the 2009 ACM SIGMOD Conference, Providence, Rhode Island, June 2009.
The author thanks Dr. Guy Lohman at IBM for his advice and enlightening discussions on this topic.