320 likes | 435 Views
Top- K Query Evaluation with Probabilistic Guarantees. Martin Theobald , Gerhard Weikum , Ralf Schenkel. Presenter: Avinandan Sengupta. Presentation Outline. Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem?
E N D
Top-K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta
Presentation Outline • Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Data and a Query Attributes Objects Top 10 midcap stocks with low β Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance
Hypothetical Graded Lists(made fit for consumption by Top-k processors) Aggregate function f = 0.5*P/E + 1.0*β-1 + 1.0*MCap weights Midcap median ≅ 4.5B PEj/Highest PE (β-1j /max(β-1j)) Grades based on how close the market cap is to the midcap median; normalized normalization
Top-k results Top-k Processor
Presentation Outline • Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Fagin’s Threshold Algorithm (TA) • Access the n lists in parallel. • As an object oi is seen, perform a random access to the other lists to find the complete score for oi. • Do the same for all objects in the current row. • Now compute the threshold τ as the sum of scores in the current row. • The algorithm stops after kobjects have been found with a score above τ.
TA with No Random Access (TA-NRA) • Access the n lists in parallel. • For an item a, compute its (B)estscore: Ba = f { f {scorej | j ∈ seen-attributes(a)}, f {highk | k ∉ seen-attributes(a)}} highk = last seen score for the kth attribute and its (W)orst score Wa = f { f {scorej | j ∈ seen-attributes(a)}, f {0 | k ∉ seen-attributes(a)}} • Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose Bm≥ Wk • this means that we also maintain a table of all seen objects with their W/B scores Running Top-k list; contains the k objects with largest W values; ties broken with B values
Issues with TA and TA-NRA • High space-time costs • Overly conservative
Presentation Outline • Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Are we solving the right problem? • Is random access possible in most common scenarios? • Web content • XML data, hierarchical data sets • Does the user need an exact top-k query result? • Or is she satisfied with an approximation?
How about an approximate solution? • Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process? • Quickly reach solution
Pictorially... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Probabilistic TA-NRA - 1 • Predict the total score of a item for which a partial score is known • Avoid the overly conservative best-score/worst-score bounds of the original TA-NRA • Instead, calculate the probability that the total score of the item exceeds a threshold (making the item interesting for the top-k result)
Probabilistic TA-NRA - 2 • If this probability is sufficiently low (below a threshold), drop the item from the candidate list. • The probabilistic prediction involves computing the convolution of the score distributions of different index lists.
Score Distribution of Lists - How? pdf 3 Parameter fitting curve fitting 1 Median 0.65 2 score 0.59 1.0
What it is and What it is not • Probabilistic guarantees are not about query run-times but about query result quality • Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set
The Math Set of seen attributes for an object Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
More Math... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
What distributions to consider? • Uniform distribution • simplest assumptions • convolutions based on moment-generating functions with generalized Chernoff-Hoeffding bounds • Poisson estimations • efficiently evaluated, provides a reasonable fit for tf*idf based score distributions for Web corpora • Histograms • when above methods fail • Involves non-trivial computation (done offline per list)
Solving Convolutions? Difficult • When the PDF is a uniform distribution, its solution becomes difficult • Use alternate techniques other than convolution • Off-load computation to available probabilistic engines – OpenMaple, etc
Queue Management Source: http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf (author’s webpage)
Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Results Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)
Performance as a function of ε Source: Paper
Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores Source: Paper
Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion
Conclusion • New algorithms were developed based on probabilistic score predictions • Trade-off a small amount of top-k result quality for a drastic reduction of sorted accesses • Intelligent management of priority queues for efficient implementation was presented • Assumptions were made regarding the aggregation function to be summation • Future work to be based on ranked retrieval of XML data and integrating into XXL search engine