460 likes | 480 Views
This paper introduces a novel top-k query processing algorithm that optimizes access to index lists over multiple attributes. The algorithm combines scores using a monotonic aggregation function to find the top-k items with the highest total scores. It employs a sorted access and random access strategy to efficiently retrieve results. The proposed algorithm significantly reduces sorted accesses and carefully schedules random accesses, improving performance. The study compares the algorithm with existing NRA and CA methods, highlighting its effectiveness in reducing runtime overhead.
E N D
IO-Top-k:Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger Bast, Ralf Schenkel, Martin Theobald, Gerhard Weikum VLDB 2006, Seoul, Korea
Setup Pre-computed index-lists over multiple attributes combine scores by some monotonic aggregation function: . res + .zoom - .price lists are accessible by both sorted and random accesses Goal: find the top-k items with highest total scores single numeric score for every item for each attribute
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: lists sorted by score
current score best-score List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 1 read one item from every list Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 2 read one item from every list Candidates min top-2 score: 0.9 maximum score for unseen items: 1.8 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 3 read one item from every list Candidates min top-2 score: 1.3 maximum score for unseen items: 1.3 lists sorted by score min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 4 read one item from every list Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 lists sorted by score min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue
List 2 List 1 List 3 Top-k algorithms: example Fagin’s NRA Algorithm: round 5 read one item from every list Candidates min top-2 score: 1.6 maximum score for unseen items: 0.8 lists sorted by score no extra candidate in queue Done!
Top-k algorithms • NRA performs only sorted accesses (SA) (No Random Access) • Random access (RA) • lookup actual (final) score of an item • costlier than SA (100 – 100,000 times), cR/cS:= (cost of RA)/(cost of SA) • often very useful • CA (Combined Algorithm), (Fagin et al., 2001) • one RA after every cR/cS SAs • total cost of SA ~ total cost of RA • Measure of effectiveness (access cost): #SA + cR/cS x #RA • Full-merge: compute scores for all items followed by partial sort • simple and efficient • important baseline for any top-k algorithm • Problems with NRA, CA • high bookkeeping overhead: cannot beat full-merge in runtime • for “high” values of k, gain in even access cost not significant
Top-k algorithms • Greedy heuristics for sorted access scheduling, based on crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001) • RankSQL: ordering of binary rank joins at query planning time(Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05) • Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different) • MPro(Chang and Hwang, SIGMOD 2002) • Upper, Pick(Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04) • Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05) • Main related previous works: NRA, CA
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 1: same as NRA not necessarily round robin Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 2 not necessarily round robin Candidates min top-2 score: 0.9 maximum score for unseen items: 1.4 lists sorted by score min-top-2 < best-score of candidates
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 3 not necessarily round robin Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 lists sorted by score min-top-2 < best-score of candidates potential candidate for top-2
List 2 List 1 List 3 Our algorithm: IO-Top-k Round 4: random access for item 83 not necessarily round robin Candidates min top-2 score: 1.6 maximum score for unseen items: 1.1 lists sorted by score no extra candidate in queue random access for item 83 Done! fewer sorted accesses carefully scheduled random access
Outline • Our contributions • Inverted block-index data structure • Sorted access scheduling • Random access scheduling • Lower bound • Experiments • Conclusion
Inverted block-index Lists are first sorted by score
sort each block by item-id Inverted block-index Lists are first sorted by score Top-k algorithm with block-index split into blocks full-merge 1 1 1 blocks are sorted by item ids, efficiently merged by full-merge! 2 2 2 full merge and so on… pruned 3 3 3 • Choose block size balancing disk seek time and data transfer rate • Low overhead: prune once every round
Sorted access scheduling General Paradigm Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index
Sorted access scheduling General Paradigm • We assign benefits to every block of each list • Optimization problem • Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized • We can show: this problem isNP-Hard, the well known Knapsack problem reduces to it • But, the number of blocks to choose and number of lists to choose from are small: we can solve itby enumerating all possibilities • We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index scans to different depths in lists
31 32 Sorted access scheduling Knapsack for Score Reduction (KSR) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index
31 32 Sorted access scheduling Knapsack for Score Reduction (KSR) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) • Candidate item d is already seen in list 3. If we scan list 3 further, score sd and best-score bd of d do not change: no benefit • In list 2, d is not yet seen. If we scan one block (block22) from list 2 • with high probability d will not be not found in that block: best-score bd of d decreases by 22 • Benefit of block B in list i dB (1 - Pr[d found in B]) ~ dB sum taken over all candidates d not yet seen in list i Inverted Block-Index item d [sd,bd] scanned till some depth
Random access scheduling Redundant random accesses of CA • CA: one RA after every cR/cS SAs • Many RAs turn out to be redundant • Our strategy: two-phase algorithm • First sorted access rounds only, then switch to random access: no redundant random access • Switch from SA to RA, when • max-score for unseen ≤ min-top-k score • estimated RA-cost ≤ total SA-cost so far • cost of SA ~ cost of RA CA: RA for item d But d is found anyway in subsequent SA round need to estimate cost of RA
current min-top-3 score Random access scheduling Estimating number of random accesses A crude upper estimate: #of items in queue random access candidate items sorted by best score: CA style best-scores current scores pruned Each random access can prune some candidates, so better estimate of #RAs necessary lists scanned till some depths by sorted access
current min-top-3 score Random access scheduling Estimating number of random accesses candidate items sorted by best score: CA style random accesses item d [sd,bd] bd d is pruned If there are at least three items before d with final score > bd, d will be pruned before random access
current min-top-3 score Random access scheduling Estimating number of random accesses candidate items sorted by best score: CA style random accesses item d [sd,bd] bd next: RA for d If there are less than three items before d with final score > bd, a random access for d must be made
j-1 items Random access scheduling Estimating number of random accesses • Let d be the j-th item dj by best-score ordering • For all i < j, define random variables Fi,j Fi,j = 1 if final-score(di) > the best-score(d), 0 otherwise • We compute Pr[Fi,j = 1] using histogram of the score distributions of the lists • Observation: Pr[RA is made for d] = Pr[F1,j+ + Fj-1,j < k] • Expected #of random accesses j Pr[F1,j+ + Fj-1,j < k] the sum is taken over all candidate items candidate items sorted by best score item d [sd,bd] bd current min-top-k score For General k: There will be random access for d if and only if #of items before d with final score > bd is less than k
Experiments: estimate of RA TREC Terabyte data, TREC 2005 adhoc task queries After all sorted accesses #items in queue, #RA estimated and #RA actually done queue size Total RA for 50 queries queue size EST DONE EST DONE
Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000 Schedule 2 9 x 10000 + 1000 x 12 = 102,000
block size 10,000 Lower bound: what is the best possible? Try every possible SA-schedule Count essential number of RAs that must be done #SA CR/CS x #RA = Total cost Schedule 1 6 x 10000 + 1000 x 75 = 135,000 Schedule 2 9 x 10000 + 1000 x 12 = 102,000 Schedule 3 12 x 10000 + 1000 x 3 = 123,000 … … … … … … … … Lower bound … … 102,000 carefully engineered dynamic programming to try out all schedules
Experiments: TREC TREC Terabyte benchmark collection • over 25 million documents, 426 GB raw data • 50 queries from TREC 2005 adhoc task CA 250 4,000,000 NRA CA full merge full merge NRA average running time (milliseconds) average cost (#SA + 1000 x #RA) IO-Top-k (OUR) 100 IO-Top-k (OUR) lower bound 0 0 10 50 100 200 500 10 50 100 200 500 k k
full merge NRA CA IO-Top-k (OUR) lower bound Experiments: HTTP logs FIFA World Cup HTTP logs • World cup 1998 • 1.3 billion HTTP requests • schema Log( interval, user-id, bytes ) • aggregated for each user within one-day intervals • typical query: find k users with most usage during June 1-10
full merge CA NRA IO-Top-k (OUR) lower bound Experiments: IMDB IMDB movie data • more than 375,000 movies, 1,200,000 persons • attributes: Title, Genre, Actors, Description • 20 human generated queries
Conclusion We presented • An inverted block-index data structure • efficient: optimizes disk access • performs fast merge in blocks, minimizes overhead • Integrated sorted access and random access scheduling • SA scheduling: maximizes benefit of scanning blocks • RA scheduling: effectively estimate RA-cost at every round • postpone RA till the end of all SA: save redundant RAs • Lower Bound • shows that our algorithm is close to the best possible
31 32 Sorted access scheduling Knapsack for Benefit Aggregation (KBA) • Pre-compute expected score eij of an item seen in block j of list i : (average score of the block) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index
31 32 Sorted access scheduling Knapsack for Benefit Aggregation (KBA) • Pre-compute expected score eij of an item seen in block j of list i : (average score of the block) • Pre-compute score reduction ij of every block of each list : (max-score of the block – min-score of the block) • Candidate item d is already seen in list 3. If we scan list 3 further, score sd and best-score bd of d do not change • In list 2, d is not yet seen. If we scan one block from list 2 • either d is found in that block: score sd of d increases, expected increase = e22 • or d is not found in that block: best-score bd of d decreases by 22 • Benefit of block B in list i d eB Pr[d found in B] + B (1 - Pr[d found in B]) The sum is taken over all candidates d not yet seen in list i Inverted Block-Index item d [sd,bd]
j-1 items Random access scheduling: details Estimating number of random accesses • Let d be the j-th items by best-score ordering • For all i < j, Define random variables Fi,j which takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise • Compute Pr[Fi,j = 1] using the expected score gain of the i-th item from lists where it is not yet seen • Also define a random variable Rj which takes value 1 if a random access is made for d, 0 otherwise • Observation: Pr[Rj = 1] = Pr[F1,j+ + Fj-1,j < k] • Let Xj := F1,j + + Fj-1,j • Assume Fi,js are independent, then Xj follows Poisson distribution with mean i Pr[Fi,j = 1] • We can compute Pr[Xj < k] using the incomplete gamma function • Expected number of random accesses is j E(Rj) = j Pr[Rj = 1] = j Pr[Xj < k] the sum is taken over all candidate items candidate items sorted by best score item d [sd,bd] bd current min-top-k score There will be random access for d if and only if #of items before d with final score > bd is less than k
Other Experiments TREC Terabyte collection indexed with BM25 scores varying query size title fields: average size 3 description fields: average size 8 For different values of cost of RA compared to cost of SA CR/S ratio: 100, 1000 and 10000 3,000,000 20,000,000 cost cost 0 0 query size: 3 query size: 8 (cost of RA)/(cost of SA)