What’s the Big Deal About Query Optimization? Part 1
SQL queries have grown quite complex, and the challenges for query optimization need to be considered
From the early days of SQL, database administrators (DBAs) have specified the what without worrying about the how. The prevailing assumption was that query optimization takes care of the how. A lot has changed since the days of relational databases, and SQL queries have become increasingly complex. The amount of data has grown exponentially, and the service-level expectations have become tighter than ever. What was good then is simply not acceptable today.
This situation offers an excellent opportunity to look at some of the challenges DBAs face with SQL queries and how some vendors are addressing them. In addition, there is the future of query optimization to consider, what it may look like, and why perfection may not be attainable. The first installment of this series takes a look those challenges, and the concluding installment focuses on current progress and future expectations for query optimization.
Understanding the challenges
From a programmer’s point of view, there is a SQL statement input into a query optimizer, some magic happens, and voilà†–a fantastic query plan results. If only the process were that simple. The reality is quite different and is much more complex (see Figure 1). Moreover, the factors that influence the access path selection are quite varied (see Figure 2).
Figure 1. The reality of query optimization in today’s database systems
Figure 2. The complexity of access path selection
Selecting an access path can be challenging. Tapio Lahdenmäki—a performance management consultant with a long career at IBM and the author of several books—classifies query optimization problems into two broad types.1 The first involves the possibility that optimization is not explored, while the second is for the possibility that it is evaluated incorrectly:
- Type 1: The optimizer does not see the best access path. In this case, a SQL statement can be broken up—for example, change OR to UNION, use multiple statements instead of one, or rewrite the SQL statement as follows:
change YEAR(BIRTH_DATE) = 2011 to BIRTH_DATE BETWEEN '2011-01-01' AND '2011-12-31'
- Type 2: The optimizer cost estimates are incorrect. This problem is far more serious. More detailed and accurate stats can be collected, plug stats, use SQL tricks, or use optimization hints.
Another problem to consider is the enormity of the exhaustible list of all possible access paths. Many database professionals tend to underestimate this problem by orders of magnitude. For example, the simple query number 8 of the standard Transaction Performance Processing Council ad hoc decision support (TPC-H) benchmark—which is a relatively simple eight-way join—has 22 million access path choices.2
Today’s processing requirements are such that many users want the time allotted to optimize a query limited to just a few seconds at best. They cannot wait hours to create the perfect access path. However, a highly critical part of the access path creation process is cardinality estimation—answering the question, “Given the predicates, how many rows are expected to be returned?” Cardinality estimation is the weakest link and is inherently flawed because of three major factors: data skew, correlation, and host variables along with how uncertainty is handled.
Data not uniformly distributed can create a data skew occurrence, which can be point skewed—for example, a status code in an employee table with active = 95 percent and retired = 5 percent. Why does this point skew matter? In the first case, the expected filter factor is estimated at 0.50 because there are two distinct values, while the actual can be 0.95 or 0.05. The distribution can also be range skewed—for example, the salary range is USD30,000 to USD300,000 with 90 percent of employees earning less than USD100,000. In this case, the range skew matters because the filter factor for a range predicate such as SALARY BETWEEN :x AND :y is calculated as follows:
The reality is quite different, of course, and histograms, which are discussed in part 2 of this series, are one way to handle this problem.
An inaccurate estimation can occur when there is correlation across two columns. For example, the filter factors for the following two predicates are multiplied together as if the two predicates—and columns—are independent:
AND MODEL = 'CIVIC'
Of course, the reality is quite different. Who else but Honda makes a Civic? In this case, the two predicates are 100 percent correlated and should not be multiplied together to arrive at the filter factor.
Host variables and handling uncertainty
The third factor is the presence of host variables. This area is very typical for all end user–driven queries so that the SQL contains a generic predicate such as the following and the actual value is supplied by the end user:
In this case, the access path is created with no regard to the actual value unless REOPT—or Oracle’s bind peeking—is used with the associated overhead.
A problem related to host variables is the concept of uncertainty. If the optimizer has a choice between an access path with a low average cost but a potentially higher worst-case cost, should an access path be chosen over an alternative with a higher average cost but a guaranteed lower worst-case cost?
Just like cardinality estimation models, cost models used by query optimizers are also imperfect. The query optimizer does not model all hardware characteristics, runtime conditions, and physical data layouts. Although such design choices can obviously lead to reliability problems, there are often reasonable compromises that can be made. These compromises are not a major area of concern.
Given all these challenges, enhancements have progressed. The second installment of this series discusses some recent improvements and some expected, promising advances.
Please share any thoughts or question in the comments.
1 “Relational Database Index Design and the Optimizers,” by Tapio Lahdenmäki and Mike Leach, Wiley Interscience, July 2005.
2 “SQL Query Optimization: Why Is It So Hard To Get Right,” PASS Summit 2010 Keynote, by David J DeWitt, November 2010.