160 likes | 249 Views
ICS 421 Spring 2010 Query Evaluation (ii). Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa. What do these queries have in common ?. SELECT S.sname FROM Sailors S WHERE S.rating >5 ORDER BY S.age. SELECT DISTINCT S.sname FROM Sailors S.
E N D
ICS 421 Spring 2010Query Evaluation (ii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa Lipyeow Lim -- University of Hawaii at Manoa
What do these queries have in common ? SELECTS.sname FROMSailors S WHERES.rating>5 ORDER BYS.age SELECTDISTINCTS.sname FROMSailors S SELECTS.age, AVG(S.rating) FROMSailors S GROUP BY S.age Lipyeow Lim -- University of Hawaii at Manoa
The Sort Operator • Sorting is a classic problem in computer science! • Data requested in sorted order • e.g., find students in increasing gpa order • Sorting is first step in bulk loading B+ tree index. • Sorting useful for eliminating duplicate copies in a collection of records (Why?) • Sort-merge join algorithm involves sorting. • Problem: sort 100Gb of data with 1Gb of RAM. • why not virtual memory? Lipyeow Lim -- University of Hawaii at Manoa
Two-Way External Merge Sort • Pass 0: • Read a page, sort it in memory, write it to disk • Only one buffer page is needed • Pass 1, 2, 3, 4 …: • Read two (sorted) pages, merge them to fill output page, flush output page when full. • 2 input pages and 1 output page are needed INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk Lipyeow Lim -- University of Hawaii at Manoa
Two-Way Merge Sort: Example 6,2 2 Input file 3,4 9,4 8,7 5,6 3,1 PASS 0 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 Lipyeow Lim -- University of Hawaii at Manoa
Two-Way Merge Sort: Analysis 6,2 2 Input file 3,4 9,4 8,7 5,6 3,1 • Input file has N pages • Each pass reads N pages and writes N pages. • The number of passes = log2 N + 1 • So total cost is = 2N(log2 N + 1) • Idea:Divide and conquer: sort subfiles and merge PASS 0 1,3 2 1-page runs 3,4 2,6 4,9 7,8 5,6 PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9 Lipyeow Lim -- University of Hawaii at Manoa
K-Way External Merge Sort • What if we have more memory ? • Sort a file with N pages using B buffer pages: • Pass 0: • read in B pages, sort all B pages in memory, write to disk as 1 run, repeat until all N pages are sorted -- outputs N/B sorted runs • Pass 1,2,…: • Use B-1 buffer pages as input and perform (B-1)-way merge to fill 1 output buffer page. INPUT 1 . . . . . . INPUT 2 . . . OUTPUT INPUT B-1 Disk Disk B Main memory buffers Lipyeow Lim -- University of Hawaii at Manoa
K-Way Merge Sort: Analysis • B=5 buffer pages, N=108 pages • Pass 0: 108/5 =22 sorted runs of 5 pages each • Pass 1: 22/4 = 6 sorted runs of 20 pages each • Pass 2: 6/4 = 2 sorted runs of 80 & 28 pages • Pass 3: 1 sorted file of 108 pages • Number of passes = logB-1N/B+ 1 • Each pass still reads N pages and writes N pages • Total number of I/O’s = 2N * (logB-1N/B+1) Lipyeow Lim -- University of Hawaii at Manoa
Selection Operator • Index vs Table Scan • Multiple Indexes • Eg. Use index(age) & index(rating) for “age>20 AND rating>9” • Intersect RID sets using bloom filters • Eg. Use index(age) & index(rating) for “age>20 OR rating>9” • Union RID sets Lipyeow Lim -- University of Hawaii at Manoa
Projection Operator • Two steps: • Remove unwanted columns • Eliminate duplicates • How do we do step 2 ? • External merge sort • Scan sorted data to remove duplicates • Optimization: combine the 2 steps into merge sort: • Remove unwanted columns in Pass 0. • Subsequent passes can remove duplicates whenever they are encountered. SELECTDISTINCTS.sname FROMSailors S Lipyeow Lim -- University of Hawaii at Manoa
Hash Join R equijoin S on sid • Partition R into k partitions using hash function h1(R.sid) • Partition S into k partitions using hash function h1(S.sid) • Foreach partition i • Build inmemory hash table H(R[i]) for R[i] using h2(R.sid) • Foreach row in S[i] • Probe H(R[i]) • Output join tuples <r,s> • Works only on equi-joins • Total I/Os = 2*NPages(R) + 2*NPages(S) + NPages(R) + NPages(S) = 3 * [Npages(R) + Npages(S)] • Can be applied in a distributed DBMS with hash partitions on the join attribute! Lipyeow Lim -- University of Hawaii at Manoa
Example SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname R.sid=S.sid On the fly • Nested Loop Join cost 1K+ 100K*500 • On the fly selection and project does not incur any disk access. • Total disk access = 500001K (worst case) S.rating>5 AND R.bid=100 On the fly Nested Loop Join Reserves Sailors (SCAN) (SCAN) Lipyeow Lim -- University of Hawaii at Manoa
Example: Predicate Pushdown SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • Nested Loop Join requires materializing the inner table as T1. • With 50% selectivity, T1 has 250 pages • With 10% selectivity, outer “table” in join has 10K tuples • Disk accesses for scans = 1000 + 500 • Writing T1 = 250 • NLJoin = 10K * 250 • Total disk access = 2500.175 K (worst case) Nested Loop Join Temp T1 R.bid=100 S.rating>5 (SCAN) (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa
Example: Sort Merge Join SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • Sort Merge Join requires materializing both legs for sorting. • With 50% selectivity, T1 has 100 pages • With 10% selectivity, T2 has 250 pages • Disk accesses for scans = 1000 + 500 • Writing T1 & T2 = 100 + 250 • Sort Merge Join = 100 log 100 + 250 log 250 + 100+250 (assume 10 way merge sort) • Total disk access = 52.8 K Sort Merge Join Temp T1 Temp T2 R.bid=100 S.rating>5 (SCAN) (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa
Example: Index Nested Loop Join SELECTS.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 ANDS.rating>5 πS.sname 10% 50% R.sid=S.sid On the fly • With 10% selectivity, selection on R has 10K tuples • Disk accesses for scan = 1000 • Index Nested Loop Join = 10K*( 1 + log10 500) = 37K • Total disk access = 38 K S.rating>5 On the fly Index nested Loop Using Index on S.sid R.bid=100 (SCAN) Reserves Sailors What happens if we make the left leg the inner table of the join ? Lipyeow Lim -- University of Hawaii at Manoa
Join Ordering 500 10K 30K 10K • Independent of what join algorithm is chosen, the order in which joins are perform affects the performance. • Rule of thumb: do the most “selective” join first • In practice, left deep trees (eg. the right one above) are preferred --- why ? A C 20K 30K 10K 20K C B B A Lipyeow Lim -- University of Hawaii at Manoa