460 likes | 592 Views
IO-Top-k: Index-access Optimized Top-k Query Processing. Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger Bast, Ralf Schenkel, Martin Theobald, Gerhard Weikum VLDB 2006, Seoul, Korea. Setup.
E N D
IO-Top-k:Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger Bast, Ralf Schenkel, Martin Theobald, Gerhard Weikum VLDB 2006, Seoul, Korea
Setup Pre-computed index-lists over multiple attributes combine scores by some monotonic aggregation function: . res + .zoom - .price lists are accessible by both sorted and random accesses Goal: find the top-k items with highest total scores single numeric score for every item for each attribute
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: lists sorted by score
current score best-score List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 1 read one item from every list Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 2 read one item from every list Candidates min top-2 score: 0.9 maximum score for unseen items: 1.8 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 3 read one item from every list Candidates min top-2 score: 1.3 maximum score for unseen items: 1.3 lists sorted by score min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 4 read one item from every list Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 lists sorted by score min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 5 read one item from every list Candidates min top-2 score: 1.6 maximum score for unseen items: 0.8 lists sorted by score no extra candidate in queue Done!
Top-k algorithms • NRA performs only sorted accesses (SA) (No Random Access) • Random access (RA) • lookup actual (final) score of an item • costlier than SA (100 – 100,000 times), cR/cS:= (cost of RA)/(cost of SA) • often very useful • CA (Combined Algorithm), (Fagin et al., 2001) • one RA after every cR/cS SAs • total cost of SA ~ total cost of RA • Measure of effectiveness (access cost): #SA + cR/cS x #RA • Full-merge: compute scores for all items followed by partial sort • simple and efficient • important baseline for any top-k algorithm • Problems with NRA, CA • high bookkeeping overhead: cannot beat full-merge in runtime • for “high” values of k, gain in even access cost not significant
Top-k algorithms • Greedy heuristics for sorted access scheduling, based on crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001) • RankSQL: ordering of binary rank joins at query planning time(Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05) • Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different) • MPro(Chang and Hwang, SIGMOD 2002) • Upper, Pick(Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04) • Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05) • Main related previous works: NRA, CA
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 1: same as NRA not necessarily round robin Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 2 not necessarily round robin Candidates min top-2 score: 0.9 maximum score for unseen items: 1.4 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 3 not necessarily round robin Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 lists sorted by score min-top-2 < best-score of candidates potential candidate for top-2
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 4: random access for item 83 not necessarily round robin Candidates min top-2 score: 1.6 maximum score for unseen items: 1.1 lists sorted by score no extra candidate in queue random access for item 83 Done! fewer sorted accesses carefully scheduled random access
Outline • Our contributions • Inverted block-index data structure • Sorted access scheduling • Random access scheduling • Lower bound • Experiments • Conclusion
Inverted block-index Lists are first sorted by score
sort each block by item-id Inverted block-index Lists are first sorted by score Top-k algorithm with block-index split into blocks full-merge 1 1 1 blocks are sorted by item ids, efficiently merged by full-merge! 2 2 2 full merge and so on… pruned 3 3 3 • Choose block size balancing disk seek time and data transfer rate • Low overhead: prune once every round
Sorted access scheduling General Paradigm Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index scans to different depths in lists
31 32 Sorted access scheduling Knapsack for Score Reduction (KSR) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index
31 32 Sorted access scheduling Knapsack for Score Reduction (KSR) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) • Candidate item d is already seen in list 3. If we scan list 3 further, score sd and best-score bd of d do not change: no benefit • In list 2, d is not yet seen. If we scan one block (block22) from list 2 • with high probability d will not be not found in that block: best-score bd of d decreases by 22 • Benefit of block B in list i dB (1 - Pr[d found in B]) ~ dB sum taken over all candidates d not yet seen in list i Inverted Block-Index item d [sd,bd] scanned till some depth
Random access scheduling Redundant random accesses of CA • CA: one RA after every cR/cS SAs • Many RAs turn out to be redundant • Our strategy: two-phase algorithm • First sorted access rounds only, then switch to random access: no redundant random access • Switch from SA to RA, when • max-score for unseen ≤ min-top-k score • estimated RA-cost ≤ total SA-cost so far • cost of SA ~ cost of RA CA: RA for item d But d is found anyway in subsequent SA round need to estimate cost of RA
current min-top-3 score Random access scheduling Estimating number of random accesses A crude upper estimate: #of items in queue random access candidate items sorted by best score: CA style best-scores current scores pruned Each random access can prune some candidates, so better estimate of #RAs necessary lists scanned till some depths by sorted access
current min-top-3 score Random access scheduling Estimating number of random accesses candidate items sorted by best score: CA style random accesses item d [sd,bd] bd d is pruned If there are at least three items before d with final score > bd, d will be pruned before random access
current min-top-3 score Random access scheduling Estimating number of random accesses candidate items sorted by best score: CA style random accesses item d [sd,bd] bd next: RA for d If there are less than three items before d with final score > bd, a random access for d must be made
j-1 items Random access scheduling Estimating number of random accesses • Let d be the j-th item dj by best-score ordering • For all i < j, define random variables Fi,j Fi,j = 1 if final-score(di) > the best-score(d), 0 otherwise • We compute Pr[Fi,j = 1] using histogram of the score distributions of the lists • Observation: Pr[RA is made for d] = Pr[F1,j+ + Fj-1,j < k] • Expected #of random accesses j Pr[F1,j+ + Fj-1,j < k] the sum is taken over all candidate items candidate items sorted by best score item d [sd,bd] bd current min-top-k score For General k: There will be random access for d if and only if #of items before d with final score > bd is less than k
Experiments: estimate of RA TREC Terabyte data, TREC 2005 adhoc task queries After all sorted accesses #items in queue, #RA estimated and #RA actually done queue size Total RA for 50 queries queue size EST DONE EST DONE
Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000 Schedule 2 9 x 10000 + 1000 x 12 = 102,000
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000 Schedule 2 9 x 10000 + 1000 x 12 = 102,000 Schedule 3 12 x 10000 + 1000 x 3 = 123,000 … … … … … … … … Lower bound … … 102,000 carefully engineered dynamic programming to try out all schedules
Experiments: TREC TREC Terabyte benchmark collection • over 25 million documents, 426 GB raw data • 50 queries from TREC 2005 adhoc task CA 250 4,000,000 NRA CA full merge full merge NRA average running time (milliseconds) average cost (#SA + 1000 x #RA) IO-Top-k (OUR) 100 IO-Top-k (OUR) lower bound 0 0 10 50 100 200 500 10 50 100 200 500 k k
full merge NRA CA IO-Top-k (OUR) lower bound Experiments: HTTP logs FIFA World Cup HTTP logs • World cup 1998 • 1.3 billion HTTP requests • schema Log( interval, user-id, bytes ) • aggregated for each user within one-day intervals • typical query: find k users with most usage during June 1-10
full merge CA NRA IO-Top-k (OUR) lower bound Experiments: IMDB IMDB movie data • more than 375,000 movies, 1,200,000 persons • attributes: Title, Genre, Actors, Description • 20 human generated queries
Conclusion We presented • An inverted block-index data structure • efficient: optimizes disk access • performs fast merge in blocks, minimizes overhead • Integrated sorted access and random access scheduling • SA scheduling: maximizes benefit of scanning blocks • RA scheduling: effectively estimate RA-cost at every round • postpone RA till the end of all SA: save redundant RAs • Lower Bound • shows that our algorithm is close to the best possible
31 32 Sorted access scheduling Knapsack for Benefit Aggregation (KBA) • Pre-compute expected score eij of an item seen in block j of list i : (average score of the block) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index
31 32 Sorted access scheduling Knapsack for Benefit Aggregation (KBA) • Pre-compute expected score eij of an item seen in block j of list i : (average score of the block) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) • Candidate item d is already seen in list 3. If we scan list 3 further, score sd and best-score bd of d do not change • In list 2, d is not yet seen. If we scan one block from list 2 • either d is found in that block: score sd of d increases, expected increase = e22 • or d is not found in that block: best-score bd of d decreases by 22 • Benefit of block B in list i d eB Pr[d found in B] + B (1 - Pr[d found in B]) The sum is taken over all candidates d not yet seen in list i Inverted Block-Index item d [sd,bd]
j-1 items Random access scheduling: details Estimating number of random accesses • Let d be the j-th items by best-score ordering • For all i < j, Define random variables Fi,j which takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise • Compute Pr[Fi,j = 1] using the expected score gain of the i-th item from lists where it is not yet seen • Also define a random variable Rj which takes value 1 if a random access is made for d, 0 otherwise • Observation: Pr[Rj = 1] = Pr[F1,j+ + Fj-1,j < k] • Let Xj := F1,j + + Fj-1,j • Assume Fi,js are independent, then Xj follows Poisson distribution with mean i Pr[Fi,j = 1] • We can compute Pr[Xj < k] using the incomplete gamma function • Expected number of random accesses is j E(Rj) = j Pr[Rj = 1] = j Pr[Xj < k] the sum is taken over all candidate items candidate items sorted by best score item d [sd,bd] bd current min-top-k score There will be random access for d if and only if #of items before d with final score > bd is less than k
Other Experiments TREC Terabyte collection indexed with BM25 scores varying query size title fields: average size 3 description fields: average size 8 For different values of cost of RA compared to cost of SA CR/S ratio: 100, 1000 and 10000 3,000,000 20,000,000 cost cost 0 0 query size: 3 query size: 8 (cost of RA)/(cost of SA)