Top- K Query Evaluation with Probabilistic Guarantees

Top-K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta

Presentation Outline • Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion

Introduction to Top-k query processing • The threshold algorithm and its variants • Are we solving the right problem? • A probabilistic algorithm • Implementation Details • Results • Conclusion

Data and a Query Attributes Objects Top 10 midcap stocks with low β Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance

Hypothetical Graded Lists(made fit for consumption by Top-k processors) Aggregate function f = 0.5*P/E + 1.0*β-1 + 1.0*MCap weights Midcap median ≅ 4.5B PEj/Highest PE (β-1j /max(β-1j)) Grades based on how close the market cap is to the midcap median; normalized normalization

Top-k results Top-k Processor

Fagin’s Threshold Algorithm (TA) • Access the n lists in parallel. • As an object oi is seen, perform a random access to the other lists to find the complete score for oi. • Do the same for all objects in the current row. • Now compute the threshold τ as the sum of scores in the current row. • The algorithm stops after kobjects have been found with a score above τ.

TA with No Random Access (TA-NRA) • Access the n lists in parallel. • For an item a, compute its (B)estscore: Ba = f { f {scorej | j ∈ seen-attributes(a)}, f {highk | k ∉ seen-attributes(a)}} highk = last seen score for the kth attribute and its (W)orst score Wa = f { f {scorej | j ∈ seen-attributes(a)}, f {0 | k ∉ seen-attributes(a)}} • Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose Bm≥ Wk • this means that we also maintain a table of all seen objects with their W/B scores Running Top-k list; contains the k objects with largest W values; ties broken with B values

Issues with TA and TA-NRA • High space-time costs • Overly conservative

Are we solving the right problem? • Is random access possible in most common scenarios? • Web content • XML data, hierarchical data sets • Does the user need an exact top-k query result? • Or is she satisfied with an approximation?

How about an approximate solution? • Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process? • Quickly reach solution

Pictorially... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)

Probabilistic TA-NRA - 1 • Predict the total score of a item for which a partial score is known • Avoid the overly conservative best-score/worst-score bounds of the original TA-NRA • Instead, calculate the probability that the total score of the item exceeds a threshold (making the item interesting for the top-k result)

Probabilistic TA-NRA - 2 • If this probability is sufficiently low (below a threshold), drop the item from the candidate list. • The probabilistic prediction involves computing the convolution of the score distributions of different index lists.

Score Distribution of Lists - How? pdf 3 Parameter fitting curve fitting 1 Median 0.65 2 score 0.59 1.0

What it is and What it is not • Probabilistic guarantees are not about query run-times but about query result quality • Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set

The Math Set of seen attributes for an object Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)

More Math... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)

What distributions to consider? • Uniform distribution • simplest assumptions • convolutions based on moment-generating functions with generalized Chernoff-Hoeffding bounds • Poisson estimations • efficiently evaluated, provides a reasonable fit for tf*idf based score distributions for Web corpora • Histograms • when above methods fail • Involves non-trivial computation (done offline per list)

Solving Convolutions? Difficult • When the PDF is a uniform distribution, its solution becomes difficult • Use alternate techniques other than convolution • Off-load computation to available probabilistic engines – OpenMaple, etc

Queue Management Source: http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf (author’s webpage)

Results Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)

Performance as a function of ε Source: Paper

Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores Source: Paper

Conclusion • New algorithms were developed based on probabilistic score predictions • Trade-off a small amount of top-k result quality for a drastic reduction of sorted accesses • Intelligent management of priority queues for efficient implementation was presented • Assumptions were made regarding the aggregation function to be summation • Future work to be based on ranked retrieval of XML data and integrating into XXL search engine

Thanks!

Top- K Query Evaluation with Probabilistic Guarantees

Top- K Query Evaluation with Probabilistic Guarantees

Presentation Transcript

Top-k Query Processing in Uncertain Database

Query Evaluation

Top-k Query Processing

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Top-K Query Evaluation on Probabilistic Data

Efficient Top-K Query Evaluation on Probabilistic Data

A Randomized Scheduler with Probabilistic Guarantees of Finding Bugs

XPath Query Evaluation - A Top Down Approach

Query Evaluation

Top-k Query Processing and Optimization

Efficient Query Evaluation on Probabilistic Databases

Query Evaluation

Query Evaluation

Xpath Query Evaluation

IO-Top-k: Index-access Optimized Top-k Query Processing

Efficient Top-K Query Calculation in Distributed Networks

Efficient Top-k Query Evaluation on Probabilistic Data

Probabilistic Structured Query Methods

Efficient Query Evaluation on Probabilistic Databases

Top-K Query Processing Techniques for Distributed Environments

IO-Top-k: Index-access Optimized Top-k Query Processing

Query Evaluation