580 likes | 687 Views
CSG131 Summary. Donghui Zhang. What we learned. Storage (Chp 8 & 9) B+-tree (Chp 10) Hash index (Chp 11) External sort (Chp 13) Query evaluation Concurrency control (Chp 17) Recovery (Chp 18) Project: simple DBMS implentation, NEUStore SB-tree, R-tree. Tracks. Arm movement.
E N D
CSG131 Summary Donghui Zhang
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree.
Tracks Arm movement Arm assembly Components of a Disk Spindle Disk head • The platters spin (say, 90rps). • The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder(imaginary!). Sector Platters • Only one head reads/writes at any one time. • Block size is a multiple of sector size (which is fixed).
Accessing a Disk Page • Time to access (read/write) a disk block: • seek time (moving arms to position disk head on track) • rotational delay (waiting for block to rotate under head) • transfer time (actually moving data to/from disk surface) • Seek time and rotational delay dominate. • Seek time varies from about 1 to 20msec • Rotational delay varies from 0 to 10msec • Transfer rate is about 1msec per 4KB page • Key to lower I/O cost: reduce seek/rotation delays! Hardware vs. software solutions?
Index Types • A primary index is an index which controls the actual storage of a table. Typically this index is built using the primary key of the table. • A data entry is one record of the table. • A secondary index is an index which is built using some other attribute(s). • A data entry contains a set of RIDs.
secondary index 21: {10, 20} 33: {10, 20, 30} 55: {10, 30} An Example primary index 3325 6632 10 30 20
DB Buffer Management in a DBMS Page Requests from Higher Levels • Data must be in RAM for DBMS to operate on it! • For each page, dirty bit, pin count. BUFFER POOL disk page free frame MAIN MEMORY DISK choice of frame dictated by replacement policy
Page Formats: Variable Length Records Rid = (i,N) Page i • Can move records on page without changing rid; so, attractive for fixed-length records too. Rid = (i,2) Rid = (i,1) N Pointer to start of free space 20 16 24 N . . . 2 1 # slots SLOT DIRECTORY
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree
Example B+ Tree Root 17 24 5 13 30 39* 2* 3* 5* 7* 8* 19* 20* 22* 24* 27* 38* 29* 33* 34* 14* 16*
B+ Trees in Practice • Typical order: 100. Typical fill-factor: 67%. • average fanout = 133 • Can often hold top levels in buffer pool: • Level 1 = 1 page = 8 KB • Level 2 = 133 pages = 1 MB • Level 3 = 17,689 pages = 145 MB • Level 4 = 2,352,637 pages = 19 GB • With 1 MB buffer, can locate one record in 19 GB (or 0.3 billion records) in two I/Os!
B+-tree index • Structure • Search • Insert • Delete • Bulk-loading a B+-tree • Aggregation Query • SB-tree
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree
LOCAL DEPTH 2 Extendible Hashing Bucket A 16* 4* 12* 32* GLOBAL DEPTH 2 2 Bucket B 00 5* 1* 21* 13* 01 • Directory is array of size 4. • To find bucket for r, take last `global depth’ # bits of h(r); we denote r by h(r). • If h(r) = 5 = binary 101, it is in bucket pointed to by 01. 2 10 Bucket C 10* 11 2 DIRECTORY Bucket D 15* 7* 19* DATA PAGES • Insert: If bucket is full, splitit (allocate new page, re-distribute). • If necessary, double the directory. (As we will see, splitting a • bucket does not always require doubling; we can tell by • comparing global depth with local depth for the split bucket.)
Insert h(r)=20 (Causes Doubling) 2 LOCAL DEPTH 3 LOCAL DEPTH Bucket A 16* 32* 32* 16* GLOBAL DEPTH Bucket A GLOBAL DEPTH 2 2 2 3 Bucket B 5* 21* 13* 1* 00 1* 5* 21* 13* 000 Bucket B 01 001 2 10 2 010 Bucket C 10* 11 10* Bucket C 011 100 2 2 DIRECTORY 101 Bucket D 15* 7* 19* 15* 7* 19* Bucket D 110 111 2 3 Bucket A2 4* 12* 20* DIRECTORY 12* 20* Bucket A2 4* (`split image' of Bucket A) (`split image' of Bucket A)
If insert 22 …… Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 • On split, hLevel+1 is used to re-distribute entries. Level=0, N=4 Level=0 PRIMARY h h OVERFLOW h h PRIMARY PAGES 0 0 1 1 PAGES PAGES Next=0 32* 32* 44* 36* 000 00 000 00 Next=1 Data entry r 9* 5* 9* 5* 25* 25* with h(r)=5 001 001 01 01 30* 30* 10* 10* 14* 18* 14* 18* Primary 10 10 010 010 bucket page 31* 35* 7* 31* 35* 7* 11* 11* 43* 011 011 11 11 (This info is for illustration only!) (The actual contents of the linear hashed file) 100 44* 36* 00
Example of Linear Hashing Let hLevel=(a*key+b) % 4 hLevel+1=(a*key+b) % 8 Level=0 h OVERFLOW h PRIMARY 0 1 PAGES PAGES Important notice: after buckets 010 and 011 are split, Next moves back To 000! 32* 000 00 9* 25* 001 01 Next=2 30* 10* 14* 18* 22* 10 010 31* 35* 7* 11* 43* 011 11 100 44* 36* 00 101 5* 01
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree
General External Merge Sort • More than 3 buffer pages. How can we utilize them? • To sort a file with N pages using B buffer pages: • Pass 0: use B buffer pages. Produce sorted runs of B pages each. • Pass 2, …, etc.: merge B-1 runs. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers
Cost of External Merge Sort • Number of passes: • Cost = 2N * (# of passes) • E.g., with 5 buffer pages, to sort 108 page file: • Pass 0: = 22 sorted runs of 5 pages each (last run is only 3 pages) • Pass 1: = 6 sorted runs of 20 pages each (last run is only 8 pages) • Pass 2: 2 sorted runs, 80 pages and 28 pages • Pass 3: Sorted file of 108 pages • Note: for merge join, no need to write result.
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree
query compiling query execution query optimization answer SQL query parse parse tree convert logical query plan execute apply laws statistics Pi “improved” l.q.p pick best estimate result sizes {(P1,C1),(P2,C2)...} l.q.p. +sizes estimate costs consider physical plans {P1,P2,…..}
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Compiling Convert • Input: a parse tree. • Output: a logical query plan. • Algorithm: followed by . • E.Name(E.SSN<5000 AND E.Age>50(E) ) • Alternatively, a l.q.p tree.
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Consider Physical Plans • Associate each RA operator with an implementation scheme. • Multiple implementation schemes? Enumerate all. Plan 1 (always work!) on-the-fly scan
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Consider Physical Plans Plan 2 on-the-fly range search in SSN index
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Consider Physical Plans Plan 3 on-the-fly range search in Age index, follow pointers to SSN index
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Query Optimization Estimate Costs sample input • Assume for table E: • Schema = (SSN: int, Name: string, Age: int, Salary: int) • T(E) = 100 tuples. • For attribute SSN: • V(E, SSN)=100, min(E, SSN)=0000, max(E, SSN)=9999 • For attribute Age: • V(E, Age)=20, min(E, Age)=21, max(E, Age)=60 • Primary index on SSN: 3 level B+-tree, 50 leaf nodes. • Secondary index on Age: 2 level B+-tree, 10 leaf nodes, every leaf entry points to 3.5 pageIDs (on average). • Assumptions: all B+-tree roots are pinned. Can reach the first leaf page of a B+-tree directly. • Memory buffer size: 2 pages.
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Estimate Costs • Cost = 50. (The primary index has 50 leaf nodes. Assume we can reach the first leaf page of a B+-tree directly.) Plan 1 (always work!) on-the-fly scan
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Estimate Costs • Cost = 25. SSN<5000 selects half of the employees, so 50/2=25 leaf nodes. • Note: if condition is E.SSN>5000, needs 1 more I/O. Plan 2 on-the-fly range search in SSN index
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 #I/Os in the SSN index #I/Os in the Age index E.Name E.SSN<5000 AND E.Age>50 Emp E Query Optimization Estimate Costs • Cost = 10/4 + 20/4 * 3.5 = 21. Plan 3 on-the-fly range search in Age index, follow pointers to SSN index
SELECT E.Name FROM Emp E WHERE E.SSN<5000 AND E.Age>50 Pick! Query Optimization Pick Best
SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 E.Name, D.Dname E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Emp E Dept D Query Compiling Convert • Algorithm: then then . • E.Name. D.Dname(E.Did=D.Did AND E.SSN<5000 AND D.budget=1000(ED) ) • The l.q.p tree:
SELECT E.Name, D.Dname FROM Emp E, Dept D WHERE E.Did=D.Did AND E.SSN<5000 AND D.budget=1000 Query Compiling Apply Laws • Always always: (try to) replace with ! • Also, push down. E.Name, D.Dname E.SSN<5000 D.budget=1000 Emp E Dept D
Four Join Algorithms • Iteration join (nested loop join) • Merge join • Hash join • Join with index
Example E D over common attribute Did • E: • T(E)=10,000 • primary index on SSN, 3 levels. • |E|= 1,000 leaf nodes. • D: • T(D)=5,000 • primary index on Did. 3 levels. • |D| = 500 leaf nodes. • Memory available = 101 blocks
Iteration Join Use our memory (1) Read 100 blocks of D (2) Read all of E (using 1 block) + join (3) Repeat until done • I/O cost = |D| + |D|/100 * |E| = • 500 + 5*1000 = 5,500.
Merge Join Cost • Recall that |E|=1000, |D|=500. And |D| is already sorted on Did. • External sort E: pass 0, by reading and writing E, produces a file with 10 sorted runs. Another read is enough. • No need to write! Can pipeline to join operator. • Cost = 3*1000 + 500 = 3,500.
Simple example hash: even/odd R1 R2 Buckets 2 5 Even 4 4 R1 R2 3 12 Odd: 5 3 8 13 9 8 11 14 2 4 8 4 12 8 14 3 5 9 5 3 13 11
Hash Join Cost • Read + write both E and D for partitioning, then read to join. • Cost = 3 * (1000 + 500) = 4,500.
Join with index (Conceptually) For each r E do Find the corresponding D tuple by probing index. • Assuming the root is pinned in memory, Cost = |E| + T(E)*2 = 1000 + 10,000*2 = 21,000.
What we learned • Storage (Chp 8 & 9) • B+-tree (Chp 10) • Hash index (Chp 11) • External sort (Chp 13) • Query evaluation • Concurrency control (Chp 17) • Recovery (Chp 18) • Project: simple DBMS implentation, NEUStore • SB-tree, R-tree
ACID Properties of Transactions • Atomicity: all actions are carried out, or none. • Consistency: each transaction preserves the consistency of the database if executed by itself. (The users make sure of this. E.g. transfer money…) • Isolation: transactions are isolated from the effect of concurrently scheduling other transactions. • Durability: the effect of a committed transaction should last, even if system crash before all changed are flushed to disk.
Scheduling Transactions • Serial schedule: Schedule that does not interleave the actions of different transactions. • Equivalent schedules:For any database state, the effect (on the set of objects in the database) of executing the first schedule is identical to the effect of executing the second schedule. • Serializable schedule: A schedule that is equivalent to some serial execution of the transactions. (Note: If each transaction preserves consistency, every serializable schedule preserves consistency. )
Example • Conflict serializable serializable. • Serializable conflict serializable T1: R(A) W(A), Commit T2: W(A), Commit T3: W(A), Commit T1 T2 • precedence graph: • a node for each transaction • an arc from Ti to Tj if an • action in Ti precedes and • conflicts with an action in • Tj. T3
Deadlock Prevention • Assign priorities based on timestamps. Assume Ti wants a lock that Tj holds. Two policies are possible: • Wait-Die: It Ti has higher priority, Ti waits for Tj; otherwise Ti aborts • Wound-wait: If Ti has higher priority, Tj aborts; otherwise Ti waits • If a transaction re-starts, make sure it has its original timestamp
Database Tables Pages Tuples Multiple-Granularity Locks • Why? If a transaction needs to scan all records in a table, do we really want to have a lock on all tuples individually? Significant locking overhead! • Put a single lock on the table! A lock on a node implicitly locks all decendents. contains
Allowed Sharings IS IX S X -- Ö Ö Ö Ö Ö -- IS Ö Ö Ö Ö IX Ö Ö Ö Ö S Ö Ö Ö X
Tree Locking Algorithm • Search: Start at root and go down; repeatedly, S lock child then unlock parent. • Insert/Delete: Start at root and go down, obtaining X locks as needed. Once child is locked, check if it is safe: • If child is safe, release all locks on ancestors. • Safe node: Node such that changes will not propagate up beyond this node. • Inserts: Node is not full. • Deletes: Node is not half-empty.
ROOT Do: 1) Search 38* 2) Delete 38* 3) Insert 45* 4) Insert 25* A Example 20 B 35 F C 23 38 44 H G I D E 20* 22* 23* 24* 35* 36* 38* 41* 44*
Kung-Robinson Model • Xacts have three phases: • READ: Xacts read from the database, but make changes to private copies of objects. • VALIDATE: Check for conflicts. • WRITE: Make local copies of changes public. old modified objects ROOT new