570 likes | 1.49k Views
Query Optimization in Database Systems. Lindsey Blackstock CSc 6710. Overview. Description of query optimization and important concepts Survey of query optimization research Query optimization example. Basics.
E N D
Query Optimization in Database Systems Lindsey Blackstock CSc 6710
Overview • Description of query optimization and important concepts • Survey of query optimization research • Query optimization example
Basics • Query optimization describes the process by which a DBMS determines a “reasonably efficient strategy” [Elmasri 550] for executing high-level queries, such as SQL queries. • Like other high-level language statements, a query is scanned to determine tokens and parsed to check syntax. The query is also validated to ensure that the attribute and relation names used in the query are meaningful in the context of the current database schema.
Basics • After a query has been scanned, parsed, and validated, the DBMS must choose from a set of possible execution strategies for the query, and this is known as query optimization. • The part of the DBMS responsible for selecting an execution strategy is the query optimizer, and it passes the selected strategy to the query-code generator which creates the code and sends it off to be processed.
The SELECT Operation • There are several ways to implement a SELECT operation: • Linear Search (brute force) • Binary Search • Using a primary key • Using a primary key for multiple records • Using a clustering index for multiple records • Using a B+ tree index on an equality comparison Similarly, the JOIN, SELECT, and PROJECT operations have multiple implementation techniques.
Cost-Based Optimization • For a particular query, a single SELECT operation could have several solutions, each of which has an associated cost. • Cost of executing a query: • Access cost to secondary storage • Storage cost • Computation cost • Memory usage cost • Communication cost
Cost-Based Optimization • The possible solutions for a query are represented as nodes (states) in a strategy space. • Each state has an associated cost • The goal of a query optimizer is to search all states to find the node of least cost (global minimum) • Downhill move – the cost of the destination state is lower than the cost of the source state. • A state is a local minimum “if, in all paths starting at that state, any downhill move comes after at least one uphill move.” [Ioannadis 2]
Optimization Research • “Left-deep vs. Bushy Trees: an analysis of strategy spaces and its implications for query optimization” • Left-deep and bushy trees are types of join processing trees, where the leaves are base relations, the internal nodes are join operators, and the edges are directed and indicate a flow of data • If all internal nodes have at least one child, the tree is deep. Otherwise, it is bushy. • Left-deep trees have all (inner) joins as base relations.
Research • The authors consider two strategy spaces: one containing only left-deep trees (L) and one containing both left-deep and bushy trees (A). • Connection cost – the path of least cost between any two nodes (strategies) • The authors concern themselves with two parameters: • Cost distribution of local minima (i.e., which local minima have cost close to global min.) • Connection cost of low local minima (used to determine whether the connection costs between local minima are close to their costs)
Research • Strategy spaces are divided into three categories based on cost distribution. These describe the “shape” of the cost function for the space • A1. Most local minima are low • A2. Some are low, but most are high • A3. Local minima are scattered at all costs
Research • Similarly, the strategy space is divided into three categories based on connection cost of low local minima • 1. The connection cost is close to the cost of the low local minima. • 2. The connection cost is not close to the cost of the low local minima, but is still relatively low. • 3. Connection cost is much higher than the cost of the low local minima
Research • The authors consider three optimization algorithms • Iterative Improvement (II), which performs local optimizations by selecting random downhill moves until a local minimum is reached. II outputs the least cost local min. • Simulated Annealing (SA), which starts at a random state and performs random moves. If the move is uphill, it is only selected with a certain probability. This probability decreases during the process. Thus, uphill moves are more likely at the beginning of the search, and they prevent getting “stuck” in a local minimum • Two-phase optimization (2PO), which is a combination of II and SA. II is run for a time, and the output is used as the starting point for SA. SA is given a low probability for uphill moves.
Research • Without going into too much detail, the authors’ results are as follows, and are based on various statistical observations • Query optimization in space A is easier than in space L, even though space A is much larger than space L • Space L has shape A1-B1 or A2-B2, and A has shape A2-B1 or A2-B2 • Due to the fact that 2PO takes advantage of the best qualities of the other two, it is the algorithm of choice for optimizing these types of strategy spaces
Heuristic Optimization • Where cost-based optimization relies on determining a solution that minimizes cost, heuristic optimization uses a set of relational algebra rules to transform an initial query tree (will be discussed in example) into an more efficient optimized query tree. • One heuristic optimization algorithm is described on Elmasri p. 576, and it will be used in the following example.
Heuristic Optimization Example Query: Find the last names of players who were drafted after 2000 who have worked on a play called ‘halfback_pass’. SQL: SELECT Lname FROM PLAYER, WORKS_ON, PLAY WHERE PlayName = ‘halfback_pass’ AND PlayNo = PlayNum AND Pssn = Ssn AND DraftDate > ‘2000-12-31’
Heuristic Optimization Example PROJECT(Lname) SELECT(PlayName = ‘halfback_pass’ AND PlayNo = PlayNum AND Pssn = Ssn AND DraftDate > ‘2000-12-31’) x x PLAY PLAYER WORKS_ON
Heuristic Optimization Example • What’s wrong with the tree? • It should work, but it is not optimal • Why not? As is, we would have to find the Cartesian product of the PLAYER, WORKS_ON, and PLAY files. Depending on the size and number of records in each table, the Cartesian product could be huge and would contain lots of unneeded data.
Heuristic Optimization Example • First, move SELECT operations down the query tree • Second, perform the more restrictive SELECT operations first • Third, replace CARTESIAN PRODUCT and SELECT combinations with JOIN operations • Finally, move PROJECT operations down the query tree • This is called heuristic optimization
References • Ramez Elmasri and Shamkant Navathe, Fundamentals of Database Systems, 5th Ed., Boston, Pearson-Addison Wesley, 2007, Chapter 15 • Yannis E. Ioannidis, Younkyung Cha Kang, Left- deep vs. bushy trees: an analysis of strategy spaces and its implications for query optimization, ACM SIGMOD Record , Proceedings of the 1991 ACM SIGMOD international conference on management of data SIGMOD '91, Volume 20 Issue 2, 1991
THE END Thanks for your attention.