Query Optimization and Intro to Transactions

Query Optimizationand Intro to Transactions R&G, Chapter 15 Chapter 16 Lecture 18

Administrivia • Homework 3 due next Tuesday, March 20 by end of class period • Homework 4 available on class website • Implement nested loops and hash join operators for (new!) minibase • Due date: April 10 (after Spring Break) • Midterm 2 is 3/22, 1 week from today • In class, covers lectures 10-17 • Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall • Internships at Google this summer… • See http://www.postgresql.org/developer/summerofcode • Booth at the UCB TechExpo this Thursday: • http://csba.berkeley.edu/tech_expo.html • Contact josh@postgresql.org

Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Review We are here • Query plans are a tree of operators that compute the result of a query • Optimization is the process of picking the best plan • Execution is the process of executing the plan

Sid, COUNT(*) AS numbers GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Example Select S.sid, COUNT(*) AS number FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” GROUP BY S.sid Sailors: Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key) 40,000 sid values 10 ratings Reserves: Tuples 40 bytes long, 100 tuples/page, 1000 pages Clustered B+ tree on bid (key), Unclustered B+ on sid 100 distinct bid values Boats Tuples 50 bytes long, 80 tuples/page, 100 pages Unclustered B+, Clustered Hash on color, 50 distinct color values

Sid, COUNT(*) AS numbers GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Pass 1: Best access method for each relation • Sailors • No predicates, so File Scan is best • Reserves • No predicates so File Scan is best • What about Index Scan with B+ index on bid for Reserves? • Keep it in mind…tuples will come out in join order…might come in handy later for a join on bid… Sailors: Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key) 40,000 sid values 10 ratings Reserves: Tuples 40 bytes long, 100 tuples/page, 1000 pages Clustered B+ tree on bid (key), Unclustered B+ on sid 100 distinct bid values

Sid, COUNT(*) AS number GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Pass 1: Best access method for each relation • Boats • File Scan • 100 I/Os • B+ Index Scan on color • 2-3 + 80*100/50 = • 3 + 160 I/Os = 163 I/Os • Hash Index Scan on color • 1.2 + (80*100/50)/80 = • 1.2 + 2 I/Os Always keep around just in case Book says to keep it because of interesting order Cheapest Boats Tuples 50 bytes long, 80 tuples/page, 100 pages Unclustered B+, Clustered Hash on color, 50 distinct color values

bid=bid sid=sid bid=bid Reserves Reserves Color=red Color=red Sailors Reserves Boats Boats sid=sid Sailors Reserves Pass 2 • For each of the plans in pass 1: • generate left deep plans for joins- consider different order and join methods X • Question: what about SailorsXBoats?

Color=red bid=bid bid=bid Reserves Reserves Color=red Boats Boats Pass 2 • First consider which pass 1 plans to use for access path • Hash index(color) Boats, File scan Reserves • B+(color), File Scan Reserves • File scan Reserves, File Scan Boats Boats: B+ tree(color), Hash(color), File Scan Sailors: File Scan Reserves: File Scan

Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan sid=sid Sailors Reserves sid=sid Sailors Reserves Pass 2 • First consider which pass 1 plans to use for access path • File scan Reserves, File Scan Sailors • File Scan Sailors, File Scan Reserves • B+(sid), File Scan Reserves • Note: book also includes these plans: • File Scan Sailors (outer) with Boats (inner) • Boats hash on color with Sailors (inner) • Boats Btree on color with Sailors (inner) • Would you agree these should be considered?

Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan Color=red bid=bid bid=bid Reserves Reserves Color=red Boats Boats Pass 2 • Now consider join methods and consider all access paths for inner If you replace file scan of Reserves with B+ index scan, Index Nested Loops is a good choice! Sort-Merge could also be a good choice because Tuples come out in bid-order. • Hash index(color) Boats, File scan Reserves • B+(color), File Scan Reserves • File scan Reserves, File Scan Boats Reserves: Clustered B+ tree on bid (key), Unclustered B+ on sid Boats Unclustered B+, Clustered Hash on color, 50 distinct color values

Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan sid=sid Sailors Reserves sid=sid Sailors Reserves Pass 2 • Now consider join methods • File scan Reserves, File Scan Sailors • File Scan Sailors, File Scan Reserves • B+(color), File Scan Reserves Exercise: Which plans would you keep for this set of joins?

bid=bid Reserves From pass 2 Color=red Sort-merge Boats B+ index(bid) Reserves Hash index(color) Boats Pass 3 and beyond • For each of the plans retained from Pass 2, taken as the outer, generate plans for the next join • For example, let’s take the sort-merge plan for BoatsxReserves and add in Sailors: Sort-merge sid=sid Note that this plan will produce tuples in sid order; interesting order for the upcoming GROUP BY. Sailors B+ index(sid) Reserves • GROUP BY, ORDER BY, AGGREGATES are all considered after the join plans are chosen.

Nested Queries (Subqueries) • Nested queries are parsed into their own ‘query block’ and optimized separately. • An uncorrelated subquery can be computed once per query. -> Optimizer can choose a plan for the subquery and add temp operator to ‘cache’ the subquery results for use in the rest of the query • Two kinds: Uncorrelated Correlated SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.day = ’01/31/07’) SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Plan for outer query TEMP Subquery plan

Nested Queries Correlated • Optimizer chooses a plan for subquery, and treats it as a subroutine to be invoked once per tuple produced by the outer query block. SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value Plan for outer query Invoked once per tuple produced by outer query block Subquery plan Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103

Nested Queries Correlated • Sometimes correlated subqueries can be rewritten as a non-correlated query • When rewriting, need to be careful to preserve semantics Returns at most 1 tuple per sailor; need to add DISTINCT to ensure same semantics in rewritten query Equivalent non-nested query: SELECT DISTINCT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103 SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) • Rewritten query gives the optimizer more choices for optimization, and it is more likely to choose a good plan

Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB DB21 DB22 Taking it one step further… • My area of research is data integration This is a very common case in the world today: Query over multiple DBMS’s • What if I could use the query planning and execution of a DBMS to query over multiple, different DBMSs, and other data sources as well? Other DBs Files Spreadsheets …. Oracle Postgres

Taking it one step further… • Heterogenous Data Integration leads to all sorts of fun with optimization … • Compensation becomes possible… • You can do complex SQL queries over simple data sources…even if a spreadsheet can’t do joins, the DBMS optimizing the query can! • Knowing what other sources can do becomes a factor • Oracle, DB2, Postgres all support slightly different versions of SQL • If the data source isn’t a DBMS, how do I know what operations it can perform (e.g. can a spreadsheet do projects? Filters? Joins?) • Where to do the work becomes a factor… • e.g., Should I always push as much of query as possible to another DBMS? • Network cost becomes a factor… • When do we ship tuples from original source to DBMS processing the complete query?

Summary • Query optimization is an important task in a relational DBMS. • Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). • Two parts to optimizing a query: • Consider a set of alternative plans. • Must prune search space; typically, left-deep plans only. • Must estimate cost of each plan that is considered. • Must estimate size of result and cost for each plan node. • Key issues: Statistics, indexes, operator implementations.

Transaction Management Overview R & G Chapter 16 There are three side effects of acid. Enhanced long term memory, decreased short term memory, and I forget the third. - Timothy Leary

transaction Data Definition query Query Compiler Transaction Manager Schema Manager Execution Engine Logging/Recovery Concurrency Control Buffer Manager LOCK TABLE Storage Manager BUFFERS BUFFER POOL Components of a DBMS DBMS: a set of cooperating software modules

Concurrency Control & Recovery • Very valuable properties of DBMSs • without these, DBMSs would be much less useful • Based on concept of transactions with ACID properties (yep…they’re baa-aack…)

Statement of Problem • Concurrent execution of independent transactions • utilization/throughput (“hide” waiting for I/Os.) • response time • fairness • Example: T2 wins, but if there was a slight delay, maybe T1 would win -> DBMS wants to ensure a predictable outcome for concurrent users t0: t1: t2: t3: t4: t5: T1: tmp1 := read(X) tmp1 := tmp1 – 20 write tmp1 into X T2: tmp2 := read(X) tmp2 := tmp2 + 10 write tmp2 into X

Statement of problem (cont.) • Arbitrary interleaving can lead to • Temporary inconsistency (ok, unavoidable) • “Permanent” inconsistency (bad!) • Need formal correctness criteria.

Definitions • A program may carry out many operations on the data retrieved from the database • However, the DBMS is only concerned about what data is read/written from/to the database. • transaction- a sequence of read and write operations (read(A), write(B), …) • DBMS’s abstract view of a user program

Correctness criteria: The ACID properties • Atomicity: All actions in the Xact happen, or none happen. • Consistency: If each Xact is consistent, and the DB starts consistent, it ends up consistent. • Isolation: Execution of one Xact is isolated from that of other Xacts. • D urability: If a Xact commits, its effects persist.

Atomicity of Transactions A • Two possible outcomes of executing a transaction: • Xact mightcommitafter completing all its actions • or it could abort(or be aborted by the DBMS) after executing some actions. • DBMS guarantees that Xacts are atomic. • From user’s point of view: Xact always either executes all its actions, or executes no actions at all.

Mechanisms for Ensuring Atomicity A • Main approach: LOGGING • DBMS logs all actions so that it can undothe actions of aborted transactions. • Logging used by modern systems, because of need for audit trail and for efficiency reasons.

transaction T Transaction Consistency C • “Consistency” - data in DBMS is accurate in modeling real world and follows integrity constraints • User must ensure transaction consistent by itself • I.e., if DBMS consistent before Xact, it will be after also • Key point: consistent database S1 consistent database S2

Transaction Consistency (cont.) C • Recall: Integrity constraints • must be true for DB to be considered consistent • Examples: 1.FOREIGN KEY R.sid REFERENCES S 2.ACCT-BAL >= 0 • System checks ICs and if they fail, the transaction rolls back (i.e., is aborted). • Beyond this, DBMS does not understand the semantics of the data. • e.g., it does not understand how interest on a bank account is computed

Isolation of Transactions I • Users submit transactions, and • Each transaction executes as if it was running by itself. • Concurrency is achieved by DBMS, which interleaves actions (reads/writes of DB objects) of various transactions. • Many techniques have been developed. Fall into two basic categories: • Pessimistic – don’t let problems arise in the first place • Optimistic – assume conflicts are rare, deal with them after they happen.

Example I • Consider two transactions (Xacts): T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.06*A, B=1.06*B END • 1st xact transfers $100 from B’s account to A’s • 2nd credits both accounts with 6% interest. • Assume at first A and B each have $1000. What are the legal outcomes of running T1 and T2??? • $2000 *1.06 = $2120 • There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. But, the net effect must be equivalent to these two transactions running serially in some order.

Example (Contd.) I • Legal outcomes: A=1166,B=954 or A=1160,B=960 • Consider a possible interleaved schedule: T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • This is OK (same as T1;T2). But what about: T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • Result: A=1166, B=960; A+B = 2126, bank loses $6 • The DBMS’s view of the second schedule: T1: R(A), W(A), R(B), W(B) T2: R(A), W(A), R(B), W(B)

Formal Properties of Schedules I • Serial schedule: Schedule that does not interleave the actions of different transactions. • Equivalent schedules:For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule. • Serializable schedule: A schedule that is equivalent to some serial execution of the transactions. (Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )

Query Optimization and Intro to Transactions

Query Optimization and Intro to Transactions

Presentation Transcript

Query Optimization

Query Optimization

Intro to Optimization

Query Optimization

Intro To j Query

QUERY OPTIMIZATION AND QUERY PROCESSING

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Processing and Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query Optimization

Query optimization

Query Optimization

Query Optimization

Introduction to Query Processing and Query Optimization Techniques

Query Optimization

Query Optimization