330 likes | 438 Views
Query Optimization and Intro to Transactions. R&G, Chapter 15 Chapter 16 Lecture 18. Administrivia. Homework 3 due next Tuesday, March 20 by end of class period Homework 4 available on class website Implement nested loops and hash join operators for (new!) minibase
E N D
Query Optimizationand Intro to Transactions R&G, Chapter 15 Chapter 16 Lecture 18
Administrivia • Homework 3 due next Tuesday, March 20 by end of class period • Homework 4 available on class website • Implement nested loops and hash join operators for (new!) minibase • Due date: April 10 (after Spring Break) • Midterm 2 is 3/22, 1 week from today • In class, covers lectures 10-17 • Review will be held Tuesday 3/20 7-9 pm 306 Soda Hall • Internships at Google this summer… • See http://www.postgresql.org/developer/summerofcode • Booth at the UCB TechExpo this Thursday: • http://csba.berkeley.edu/tech_expo.html • Contact josh@postgresql.org
Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Review We are here • Query plans are a tree of operators that compute the result of a query • Optimization is the process of picking the best plan • Execution is the process of executing the plan
Sid, COUNT(*) AS numbers GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Example Select S.sid, COUNT(*) AS number FROM Sailors S, Reserves R, Boats B WHERE S.sid = R.sid AND R.bid = B.bid AND B.color = “red” GROUP BY S.sid Sailors: Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key) 40,000 sid values 10 ratings Reserves: Tuples 40 bytes long, 100 tuples/page, 1000 pages Clustered B+ tree on bid (key), Unclustered B+ on sid 100 distinct bid values Boats Tuples 50 bytes long, 80 tuples/page, 100 pages Unclustered B+, Clustered Hash on color, 50 distinct color values
Sid, COUNT(*) AS numbers GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Pass 1: Best access method for each relation • Sailors • No predicates, so File Scan is best • Reserves • No predicates so File Scan is best • What about Index Scan with B+ index on bid for Reserves? • Keep it in mind…tuples will come out in join order…might come in handy later for a join on bid… Sailors: Tuples 50 bytes long, 80 tuples/page, 500 pages Unclustered B+ and Hash on sid (key) 40,000 sid values 10 ratings Reserves: Tuples 40 bytes long, 100 tuples/page, 1000 pages Clustered B+ tree on bid (key), Unclustered B+ on sid 100 distinct bid values
Sid, COUNT(*) AS number GROUPBYsid sid=sid Sailors bid=bid Reserves Color=red Boats Pass 1: Best access method for each relation • Boats • File Scan • 100 I/Os • B+ Index Scan on color • 2-3 + 80*100/50 = • 3 + 160 I/Os = 163 I/Os • Hash Index Scan on color • 1.2 + (80*100/50)/80 = • 1.2 + 2 I/Os Always keep around just in case Book says to keep it because of interesting order Cheapest Boats Tuples 50 bytes long, 80 tuples/page, 100 pages Unclustered B+, Clustered Hash on color, 50 distinct color values
bid=bid sid=sid bid=bid Reserves Reserves Color=red Color=red Sailors Reserves Boats Boats sid=sid Sailors Reserves Pass 2 • For each of the plans in pass 1: • generate left deep plans for joins- consider different order and join methods X • Question: what about SailorsXBoats?
Color=red bid=bid bid=bid Reserves Reserves Color=red Boats Boats Pass 2 • First consider which pass 1 plans to use for access path • Hash index(color) Boats, File scan Reserves • B+(color), File Scan Reserves • File scan Reserves, File Scan Boats Boats: B+ tree(color), Hash(color), File Scan Sailors: File Scan Reserves: File Scan
Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan sid=sid Sailors Reserves sid=sid Sailors Reserves Pass 2 • First consider which pass 1 plans to use for access path • File scan Reserves, File Scan Sailors • File Scan Sailors, File Scan Reserves • B+(sid), File Scan Reserves • Note: book also includes these plans: • File Scan Sailors (outer) with Boats (inner) • Boats hash on color with Sailors (inner) • Boats Btree on color with Sailors (inner) • Would you agree these should be considered?
Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan Color=red bid=bid bid=bid Reserves Reserves Color=red Boats Boats Pass 2 • Now consider join methods and consider all access paths for inner If you replace file scan of Reserves with B+ index scan, Index Nested Loops is a good choice! Sort-Merge could also be a good choice because Tuples come out in bid-order. • Hash index(color) Boats, File scan Reserves • B+(color), File Scan Reserves • File scan Reserves, File Scan Boats Reserves: Clustered B+ tree on bid (key), Unclustered B+ on sid Boats Unclustered B+, Clustered Hash on color, 50 distinct color values
Boats: B+ tree(bid), Hash(bid), File Scan Sailors: File Scan Reserves: File Scan sid=sid Sailors Reserves sid=sid Sailors Reserves Pass 2 • Now consider join methods • File scan Reserves, File Scan Sailors • File Scan Sailors, File Scan Reserves • B+(color), File Scan Reserves Exercise: Which plans would you keep for this set of joins?
bid=bid Reserves From pass 2 Color=red Sort-merge Boats B+ index(bid) Reserves Hash index(color) Boats Pass 3 and beyond • For each of the plans retained from Pass 2, taken as the outer, generate plans for the next join • For example, let’s take the sort-merge plan for BoatsxReserves and add in Sailors: Sort-merge sid=sid Note that this plan will produce tuples in sid order; interesting order for the upcoming GROUP BY. Sailors B+ index(sid) Reserves • GROUP BY, ORDER BY, AGGREGATES are all considered after the join plans are chosen.
Nested Queries (Subqueries) • Nested queries are parsed into their own ‘query block’ and optimized separately. • An uncorrelated subquery can be computed once per query. -> Optimizer can choose a plan for the subquery and add temp operator to ‘cache’ the subquery results for use in the rest of the query • Two kinds: Uncorrelated Correlated SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.day = ’01/31/07’) SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Plan for outer query TEMP Subquery plan
Nested Queries Correlated • Optimizer chooses a plan for subquery, and treats it as a subroutine to be invoked once per tuple produced by the outer query block. SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) Nested block to optimize: SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid= outer value Plan for outer query Invoked once per tuple produced by outer query block Subquery plan Equivalent non-nested query: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103
Nested Queries Correlated • Sometimes correlated subqueries can be rewritten as a non-correlated query • When rewriting, need to be careful to preserve semantics Returns at most 1 tuple per sailor; need to add DISTINCT to ensure same semantics in rewritten query Equivalent non-nested query: SELECT DISTINCT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103 SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND R.sid=S.sid) • Rewritten query gives the optimizer more choices for optimization, and it is more likely to choose a good plan
Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB DB21 DB22 Taking it one step further… • My area of research is data integration This is a very common case in the world today: Query over multiple DBMS’s • What if I could use the query planning and execution of a DBMS to query over multiple, different DBMSs, and other data sources as well? Other DBs Files Spreadsheets …. Oracle Postgres
Taking it one step further… • Heterogenous Data Integration leads to all sorts of fun with optimization … • Compensation becomes possible… • You can do complex SQL queries over simple data sources…even if a spreadsheet can’t do joins, the DBMS optimizing the query can! • Knowing what other sources can do becomes a factor • Oracle, DB2, Postgres all support slightly different versions of SQL • If the data source isn’t a DBMS, how do I know what operations it can perform (e.g. can a spreadsheet do projects? Filters? Joins?) • Where to do the work becomes a factor… • e.g., Should I always push as much of query as possible to another DBMS? • Network cost becomes a factor… • When do we ship tuples from original source to DBMS processing the complete query?
Summary • Query optimization is an important task in a relational DBMS. • Must understand optimization in order to understand the performance impact of a given database design (relations, indexes) on a workload (set of queries). • Two parts to optimizing a query: • Consider a set of alternative plans. • Must prune search space; typically, left-deep plans only. • Must estimate cost of each plan that is considered. • Must estimate size of result and cost for each plan node. • Key issues: Statistics, indexes, operator implementations.
Transaction Management Overview R & G Chapter 16 There are three side effects of acid. Enhanced long term memory, decreased short term memory, and I forget the third. - Timothy Leary
transaction Data Definition query Query Compiler Transaction Manager Schema Manager Execution Engine Logging/Recovery Concurrency Control Buffer Manager LOCK TABLE Storage Manager BUFFERS BUFFER POOL Components of a DBMS DBMS: a set of cooperating software modules
Concurrency Control & Recovery • Very valuable properties of DBMSs • without these, DBMSs would be much less useful • Based on concept of transactions with ACID properties (yep…they’re baa-aack…)
Statement of Problem • Concurrent execution of independent transactions • utilization/throughput (“hide” waiting for I/Os.) • response time • fairness • Example: T2 wins, but if there was a slight delay, maybe T1 would win -> DBMS wants to ensure a predictable outcome for concurrent users t0: t1: t2: t3: t4: t5: T1: tmp1 := read(X) tmp1 := tmp1 – 20 write tmp1 into X T2: tmp2 := read(X) tmp2 := tmp2 + 10 write tmp2 into X
Statement of problem (cont.) • Arbitrary interleaving can lead to • Temporary inconsistency (ok, unavoidable) • “Permanent” inconsistency (bad!) • Need formal correctness criteria.
Definitions • A program may carry out many operations on the data retrieved from the database • However, the DBMS is only concerned about what data is read/written from/to the database. • transaction- a sequence of read and write operations (read(A), write(B), …) • DBMS’s abstract view of a user program
Correctness criteria: The ACID properties • Atomicity: All actions in the Xact happen, or none happen. • Consistency: If each Xact is consistent, and the DB starts consistent, it ends up consistent. • Isolation: Execution of one Xact is isolated from that of other Xacts. • D urability: If a Xact commits, its effects persist.
Atomicity of Transactions A • Two possible outcomes of executing a transaction: • Xact mightcommitafter completing all its actions • or it could abort(or be aborted by the DBMS) after executing some actions. • DBMS guarantees that Xacts are atomic. • From user’s point of view: Xact always either executes all its actions, or executes no actions at all.
Mechanisms for Ensuring Atomicity A • Main approach: LOGGING • DBMS logs all actions so that it can undothe actions of aborted transactions. • Logging used by modern systems, because of need for audit trail and for efficiency reasons.
transaction T Transaction Consistency C • “Consistency” - data in DBMS is accurate in modeling real world and follows integrity constraints • User must ensure transaction consistent by itself • I.e., if DBMS consistent before Xact, it will be after also • Key point: consistent database S1 consistent database S2
Transaction Consistency (cont.) C • Recall: Integrity constraints • must be true for DB to be considered consistent • Examples: 1.FOREIGN KEY R.sid REFERENCES S 2.ACCT-BAL >= 0 • System checks ICs and if they fail, the transaction rolls back (i.e., is aborted). • Beyond this, DBMS does not understand the semantics of the data. • e.g., it does not understand how interest on a bank account is computed
Isolation of Transactions I • Users submit transactions, and • Each transaction executes as if it was running by itself. • Concurrency is achieved by DBMS, which interleaves actions (reads/writes of DB objects) of various transactions. • Many techniques have been developed. Fall into two basic categories: • Pessimistic – don’t let problems arise in the first place • Optimistic – assume conflicts are rare, deal with them after they happen.
Example I • Consider two transactions (Xacts): T1: BEGIN A=A+100, B=B-100 END T2: BEGIN A=1.06*A, B=1.06*B END • 1st xact transfers $100 from B’s account to A’s • 2nd credits both accounts with 6% interest. • Assume at first A and B each have $1000. What are the legal outcomes of running T1 and T2??? • $2000 *1.06 = $2120 • There is no guarantee that T1 will execute before T2 or vice-versa, if both are submitted together. But, the net effect must be equivalent to these two transactions running serially in some order.
Example (Contd.) I • Legal outcomes: A=1166,B=954 or A=1160,B=960 • Consider a possible interleaved schedule: T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • This is OK (same as T1;T2). But what about: T1: A=A+100, B=B-100 T2: A=1.06*A, B=1.06*B • Result: A=1166, B=960; A+B = 2126, bank loses $6 • The DBMS’s view of the second schedule: T1: R(A), W(A), R(B), W(B) T2: R(A), W(A), R(B), W(B)
Formal Properties of Schedules I • Serial schedule: Schedule that does not interleave the actions of different transactions. • Equivalent schedules:For any database state, the effect of executing the first schedule is identical to the effect of executing the second schedule. • Serializable schedule: A schedule that is equivalent to some serial execution of the transactions. (Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )