160 likes | 321 Views
KLEE : A Framework for Distributed Top-k Query Algorithms. Sebastian Michel Peter Triantafillou Gerhard Weikum VLDB 2005 Presented by Amrita Tamrakar. Overview. Problem Statement KLEE The Histogram Bloom Structure Candidate Filtering Conclusion.
E N D
KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Peter Triantafillou Gerhard Weikum VLDB 2005 Presented by Amrita Tamrakar
Overview • Problem Statement • KLEE • The Histogram Bloom Structure • Candidate Filtering • Conclusion
Problem Statement:Query with t terms with index lists spread across m peers P1 ... Pm Each peer Pj stores one inverted index over a term t The top-k result = sorted list (docID,TotalScore) where TotalScore for docId = monotonic aggregation of scores of this document in all m index lists.
d78 0.9 d23 0.8 d10 0.8 d1 0.7 d88 0.2 t1 … d64 0.8 d23 0.6 d10 0.6 d10 0.2 d78 0.1 t2 … d10 0.7 d78 0.5 d64 0.4 d99 0.2 d34 0.1 t3 … Problem Definition: P0 is the peer where query is initiated P1 P0 P2 P3 • Problem to be considered: • network consumption • per peer load • latency (query response time) • processing
Naïve Solution • All m peers to send the complete index lists to Pinit and then execute a centralized TA style method • Execute TA at Pinit and access the remote index lists one entry at a time. (more message rounds needed!)
KLEE: • Different philosophy: approximate answers! • Efficiency: • Reduces (docId, score)-pair transfers • no random accesses at each peer • Two pillars: • The HistogramBlooms structure • The Candidate List Filter structure
KLEE Steps: • Exploration Step: get a better approximation of min-k score threshold (topKScore) • Optimization Step: decide: 3 or 4 steps ? • Candidate Filtering: adocID is a good candidate if high-scored in many peers. • Candidate Retrieval: get all good docID candidates.
Histogram Bloom Structure Each peer pre-computes for each index list: an equi-width histogram - Bloom filter for each cell - average score per cell - upper/lower score
Bloom Filter • A space efficient probabilistic data structure that is used to test whether an element is a member of a set • vector V of m bits initially all set to 0 • K hash functions with range from 1…m • insert n docs by hashing the ids and settings the corresponding bits • Trade off : accuracy vs. efficiency A Bloom Filter with 4 hash functions. a Є A Given a query b, we will check bits at positions h1(b), h2(b), ..., hk(b). If any of them is 0 then b is not in the set A
current candidate top-k set - c cells c cells top top 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 k k 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 b bits 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 b bits 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 candidates candidates 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 Histogram Histogram topKScore / m topKScore / m Exploration Step Coordinator Peer P0 Cohort Cohort Peer Pj Peer Pi score score ... ... Index List Index List
Exploration Step: To Calculate topKScore Pinit has to • Find the missing score • Find the missing document if they are not present in the index list of some peers Pj. Uses the bloom filter of that peer to find out where the document may lie in the histogram cell and get the average of that cell as the score of that document. Replace all missing scores, Pinit computes the top-k list and identifies the score of the kth document in the list as topKScore
topKScore / m threshold Candidate List Filter Matrix • Goal:filter out unpromising candidate documents in step 2 • estimate the max number of docs that are above the mink / m threshold (Maximum_size_candidate_list) number of documents score • Send this number and the threshold to the peers
000000001000000100000000000100000 Candidate List Filter Matrix (CLFM) Select all columns with at least R bits set Candidate List Filter Matrix Each peer returns a Bloom Filter that “contains” all docs above the topKScore / m threshold 1 010101001011110101001001010101001 For m peers CLF 010010011001011111001001010111110 .. ..m 101010101010100110010010011110000 Redefined CLF
candidate filter matrix x x x 0000100000100000001 candidates min-k / m KLEE : Candidate Filter Coordinator Peer P0 candidate set current top-k min-k / m Cohort Peer Pi Cohort Peer Pj 010010000100010001 100010100000010001 top k 0000100000100000001 0000100000100000001 score ... Index List
candidate filter matrix x x x 0000100000100000001 candidates early stopping point min-k / m Coordinator Peer P0 candidate set current top-k Cohort Peer Pi Cohort Peer Pj 010010000100010001 100010100000010001 top k 0000100000100000001 0000100000100000001 score ... Index List
Conclusion • KLEE: approximate top-k algorithms for wide-area networks • significant performance benefits can be enjoyed, at only small penalties in result quality • flexible framework for top-k algorithms, allowing for trading-off • efficiency versus result quality and • bandwidth savings versus the number of communication phases. • various fine-tuning parameters