Optimizing Database Queries: Exploring the Heuristic and Cost-Based Approaches
Consider and compare two main approaches to finding the most efficient way to execute a database query: Heuristic and Cost-Based Approaches.
Join the DZone community and get the full member experience.
Join For FreeI think everyone at least once used the explain
command or at least heard about it. This command demonstrates the query execution plan, but how exactly the DBMS arrives at it remains a mystery. And how does the DBMS understand that the selected query is optimal? Does it check all possible options?
In this article, I will try to give a little idea of how query optimizers work from a theoretical point of view.
Let's start with the fact that there are two main approaches to finding the most efficient implementation option: Heuristic and Cost-Based Approaches.
The Heuristic Approach
The heuristic approach in query optimization relies on predefined rules and guidelines to guide the optimization process. It involves using a set of static heuristics or principles to determine the execution plan without considering the actual cost of each operation. These heuristics are typically based on the experience and knowledge of database experts and aim to capture common optimization patterns and best practices.
One of the key advantages of the heuristic approach is its simplicity and ease of implementation. By following a set of predefined rules, the optimizer can quickly generate an execution plan for a given query. This makes the heuristic approach particularly useful in scenarios where there is a need for fast query optimization or when the cost-based approach is computationally expensive.
Common heuristic techniques used in query optimization include rule-based optimization and index selection. Rule-based optimization involves applying a set of predefined rules to transform the initial query into an equivalent, more efficient form. These rules may include algebraic transformations, predicate pushdown, or join reordering. Index selection, another heuristic technique, involves identifying the most appropriate indexes to speed up query execution based on heuristics such as selectivity and cardinality estimation.
However, the heuristic approach has its limitations. Since it relies on predefined rules, it may not account for the dynamic nature of data distributions or changing workload characteristics. As a result, the heuristic approach may produce suboptimal plans in certain scenarios. Furthermore, the heuristics themselves may be based on general assumptions that do not hold true for all situations. Thus, while the heuristic approach provides quick optimization solutions for many queries, it may not always deliver the best performance.
To overcome these limitations, a cost-based approach to query optimization has been developed. This approach, which we will explore in the next section, takes into account the estimated cost of executing different query plans. By considering the actual costs, the cost-based approach aims to generate more optimal execution plans tailored to the specific characteristics of the data and workload.
In summary, the heuristic approach offers a practical and efficient way to optimize queries by following predefined rules and guidelines. It provides quick optimization solutions but may not always yield the best performance due to its static nature and inability to consider actual costs. Nevertheless, the heuristic approach remains valuable in situations where time constraints or computational limitations make the cost-based approach less feasible.
The Cost-Based Approach
Let's move on to a more interesting, in my opinion, approach — The Cost-Based Approach. Its idea is to evaluate the effectiveness of the execution plan based on some parameters. This parameter can be the execution time, the number of physical disk accesses, the sizes of intermediate fetches from the database, etc.
The first thing the optimizer does with this approach is to build a query tree. Let's figure out what it is.
A query tree is a tree structure, the nodes of which contain logical operators corresponding to individual query operations.
In other words, something like this:
For each such tree, you can apply a transformation - logical or physical. Logical transformations generate new trees by changing the structure of the original ones, and physical transformations replace logical operators with their specific implementations, called physical operators, without changing the structure of the tree. So, for example, the logical JOIN
operator can be replaced with physical LOOP JOIN
or MERGE JOIN
.
Transformations are performed based on some set of rules consisting of a template and a substitution. The template shows which expression this rule can be applied to, and the substitution shows what will be the result of the changes.
Having figured out what a search tree is, let's introduce two more definitions: logical equivalence and search area.
Query trees are called logically equivalent if they produce the same output. Logically equivalent query trees, together with their physical implementations, form a group of equivalent queries.
To reduce the number of logically equivalent expressions, they can be converted into so-called group expressions. At each node of the tree of such an expression, there is a logical operator that accepts groups of equivalent queries as input and not just individual queries.
The search area is the set of all possible logically equivalent query trees. The initial search area is the logical tree obtained as a result of parsing the required query. When replacing each of the tree nodes with its logical equivalents, the search area expands. The final search area will thus be the set of all search trees equivalent to the desired query. More precisely, one tree, each of the nodes of which will be the full equivalent group of a similar node in the original tree. I agree it sounds complicated, but the theory is what it is.
So, with the definitions figured out, now let's go directly to the assessment of the cost of the plan. It consists of three steps: cardinality assessment, cost model application, and plan enumeration. Let's consider each separately.
Stage 1. Evaluation of Cardinality
For those who did not know or forgot, let me remind you that cardinality is the size of the sample returned by some expression. Most often, the size is understood as the number of tuples received as a result of the query, but sometimes it can also be about the number of pages. Obviously, if you have a ready-made sample, it will not be difficult to calculate the cardinality, but is it possible to somehow evaluate it without executing the query itself?
Yes, and there are different approaches to this. The two most popular of these are population-based estimation and the use of profiles.
The first option is to analyze the sample "in miniature." This approach, for example, joins two small sets of tuples instead of joining two complete tables. The resulting score is then extrapolated to the full compound.
In the second case, some additional information about the database tables is used, for example, the size, the number of unique attributes, the most common attributes, etc. Based on these data, you can "estimate" the size of the result of a particular query. Such estimates are rough, and to refine them, they often resort to the use of histograms - information about the distribution of data from one domain in a table. When constructing histograms, the considered domain is divided into intervals, and then for each of them, the number of values that fall into it is calculated. You can also define the size of the intervals in different ways: build them equal in range or in the number of values.
In this case, the assessment of cardinality is carried out using another property of queries - selectivity. Selectivity is the ratio of the number of rows returned by a query to the total number of rows in the table. That is if the user
table contains records of 100 people, 25 of which are under 50 and 75 are older, then the selectivity of the query SELECT * FROM user WHERE age > 50
will be 75/100=0.75.
Let's get back to the evaluation of cardinality. Suppose the user
table has 1000 records, and according to the distribution histogram, the number of users with age=20
is 100, and the number of users with sex='male'
is 500.
Then the query SELECT * FROM user WHERE age=20 AND sex='male'
would be evaluated as follows: age=20
would give a selectivity of 100/1000=0.1 and sex='male'
respectively 500/1000=0.5. The final selectivity will be 0.1*0.5=0.05. The cardinality of this query, in this case, can be estimated as 1000*0.05=50 records.
Well, if there is no data on the distribution of the attribute values of a certain column, then you can resort to a number of assumptions, for example, to assume that the attribute values are evenly distributed.
Stage 2. Applying the Cost Model
The cost model specifies the mapping of the resulting plan tree to a certain number that will indicate its final cost.
In turn, the cost of the plan is the sum of the costs of all the operators included in it. The cost of an operator can depend on many factors. Some of them are predetermined and depend, for example, on the hardware on which the DBMS is deployed. And a part can be calculated dynamically - for example, the evaluation of the cardinality of an operator, the fullness of buffers, or the availability of free threads for parallel execution.
Stage 3. Enumeration of Plans
Enumeration of plans is the choice of the most optimal sequence of joins. It is important to be aware of the plan build time when using too many tables joins in a single query. Most often, this problem is solved by applying a number of heuristics to the detriment of the optimality of the final plan.
The two most commonly used enumeration algorithms are bottom-up plan tree traversal using dynamic programming and top-down plan tree traversal using memoization.
First Approach: Bottom Up
The first algorithm we'll look at is bottom-up tree traversal using dynamic programming. Here, to calculate the optimal plan at the i-th level of the tree, all plans at the i-1st level of the tree are calculated, and the best one is selected from them. Obviously, with this approach, the choice of the optimal plan will degenerate into a complete enumeration of all options, so for practical use, it needs to be modified. So P.G. Salinger proposed to consider only left-sided trees, in which the elements on the left branch have a minimum cardinality estimate.
Approximately like this:
To understand why this works, let's look at an example. Suppose we have three tables B, C, and D. Let A be the table that results from joining tables C and D.
Assume that some contain operation returns true if the table passed as the first argument contains a tuple that can be naturally concatenated with the tuple passed in the second argument (matching attribute names are equal).
Let's try to connect A to B first and then B to A. In the first case, the connection process can be described by the following pseudocode:
for a in A:
if (contains(B, a)):
return a x b
If we assume that the contains method for finding the desired tuple will check each record in Table B in turn, then the above operation will obviously have quadratic complexity. But since B is always present in the database, an index can be built for it, which will allow faster searches, and the asymptotic complexity of the problem will decrease.
In the second case, the process can be represented by the following (very different) pseudocode:
for b in B:
if (contains(A, b)):
return b x a
It would seem that the reasoning here should be similar, but it is not. Since table A is an intermediate result of the calculation, it will not contain predefined indexes, which in turn means that the complexity of the operation, in this case, will always be quadratic, regardless of the presence of indexes in table B. So, in theory, such trees should not be considered at all — why, if the option is obviously ineffective?
Second Approach: Top-Down
Another approach, called memoization, is to build a tree whose nodes are groups of operations and use them to build the final result. To do this, each operator in the group is assigned a cost estimate, and at each level, one “winner” is selected — the most optimal operator from his group. The set of groups considered in the process of selecting the optimal result is called MEMO. In many modern optimizers, the results of choosing the optimal variant for each node are remembered in order to reduce the search space in the future.
Comparing Heuristic and Cost Approaches
Both the heuristic and cost-based approaches have their strengths and weaknesses in query optimization. Let's compare these approaches based on several key factors:
- Optimality: The cost-based approach, by considering the actual costs, has the potential to generate more optimal execution plans tailored to the specific data and workload characteristics. In contrast, the heuristic approach relies on predefined rules and guidelines, which may not always lead to the best performance.
- Adaptability: The cost-based approach excels in adaptability, as it can dynamically adjust decisions based on changing data distributions and workload characteristics. The cost model and cardinality estimation enable the optimizer to make more informed decisions. On the other hand, the heuristic approach lacks the ability to adapt and may produce suboptimal plans in scenarios where data characteristics change over time.
- Computational Complexity: The heuristic approach is generally computationally less expensive compared to the cost-based approach. It relies on predefined rules and guidelines, enabling quick optimization solutions. In contrast, the cost-based approach involves cost estimation, cardinality estimation, and exploration of plan alternatives, which can be computationally intensive. However, advancements in query optimization algorithms and techniques have reduced the computational overhead of the cost-based approach.
- Data and Workload Considerations: The heuristic approach may be more suitable in cases where the data distribution is relatively stable, and the workload characteristics are well-known. It can provide quick optimization solutions based on predefined rules, which are often designed to capture common optimization patterns. The cost-based approach, on the other hand, shines in scenarios with changing data distributions and evolving workloads, as it can adapt and generate plans based on actual cost estimates.
- Query Complexity: The complexity of the query itself can influence the choice between heuristic and cost-based approaches. Simple queries with few joins and filters may benefit from the simplicity and quick optimization solutions offered by the heuristic approach. On the other hand, complex queries involving multiple tables, complex joins, and aggregations may benefit from the cost-based approach, which can explore a larger search space and consider the estimated costs of various plan alternatives.
In practice, the choice between the heuristic and cost-based approaches depends on a variety of factors, including the nature of the data, workload characteristics, query complexity, and performance requirements. In some cases, a hybrid approach that combines elements of both approaches may be adopted to leverage their respective strengths.
It's worth noting that the field of query optimization continues to evolve, and ongoing research efforts aim to develop more advanced techniques that bridge the gap between the heuristic and cost-based approaches. These efforts aim to combine the efficiency of heuristic methods with the adaptability and optimality of cost-based methods, providing even better optimization solutions for complex query workloads.
In conclusion, both the heuristic and cost-based approaches play important roles in query optimization. While the heuristic approach offers simplicity and quick optimization solutions, the cost-based approach provides adaptability and the potential for more optimal plans. By understanding their characteristics and trade-offs, database practitioners can make informed decisions and employ the most appropriate approach based on the specific requirements of their systems and queries.
Conclusion
Query optimization is a critical component of database management systems, aiming to enhance the performance and efficiency of query execution. In this article, we explored two main approaches to query optimization: the heuristic approach and the cost-based approach. Each approach has its strengths and weaknesses, and understanding their characteristics is essential for making informed decisions in optimizing queries.
The heuristic approach offers simplicity and quick optimization solutions based on predefined rules and guidelines. It is particularly suitable in scenarios with stable data distributions and well-known workload characteristics. The cost-based approach, on the other hand, considers the estimated costs of executing different query plans, enabling adaptability and the potential for more optimal plans. It excels in scenarios with changing data distributions and evolving workloads.
The choice between the heuristic and cost-based approaches depends on various factors, including the nature of the data, workload characteristics, query complexity, and performance requirements. In practice, a hybrid approach that combines elements of both approaches may be employed to leverage their respective strengths.
Advancements in query optimization algorithms and techniques continue to bridge the gap between the heuristic and cost-based approaches. Ongoing research efforts aim to develop more advanced optimization techniques that offer the efficiency of heuristic methods combined with the adaptability and optimality of cost-based methods. These advancements will further enhance the performance of query execution in complex and dynamic environments.
As database systems continue to handle ever-increasing volumes of data and more complex queries, the importance of query optimization cannot be overstated. Efficient query execution directly impacts the overall performance and responsiveness of applications and systems. By leveraging the appropriate optimization techniques, organizations can derive valuable insights from their data more efficiently and effectively.
In conclusion, query optimization remains a fascinating and evolving field, with the heuristic and cost-based approaches serving as two key pillars. By understanding their principles, techniques, and trade-offs, database practitioners can navigate the optimization landscape and employ the most appropriate approach to ensure optimal query performance in their systems. Through ongoing research and advancements, the quest for even more efficient and adaptable query optimization techniques continues, setting the stage for future innovations in this critical area of database management.
Opinions expressed by DZone contributors are their own.
Comments