Evaluating Top- K Selection Queries

Evaluating Top-KSelection Queries Surajit ChaudhuriMicrosoft Research Luis GravanoColumbia University

Motivating Example Find 4-bedroom houses priced at $350,000 • Exact matches often too restrictive • Rank of houses that are closest to specification more desirable

Motivating Example (cont.) Find 4-bedroom houses priced at $350,000 • House 1: 5 bedrooms; $400,000; Score=0.9 • House 2: 4 bedrooms; $485,000; Score=0.8 • House 3: 6 bedrooms; $785,000; Score=0.3

Top-K Queries over Precise Relational Data • Support approximate matches with minimal changes to the relational engine • Initial focus: Selection queries with “equality” conditions

Outline • Definition of top-k queries • Execution alternatives • Mapping of top-k queries to selection queries • Experiments

Top-K Selection Queries • Specify an n-dimensional target point • Define scoring function • Specify k Answer:k objects with the best score for the target point (i.e., the “top k” objects)

Specifying Top-K Queries using SQL • Select * • From R • Order [k] By Scoring_Function

Scoring Functions Measure Degree of Match • Assume attributes defined over metric space • Score on any one attribute is well defined • How to aggregate scores across attributes?

Scoring Functions • Normalize attribute scores to be in [0,1] range • Combine scores using popular aggregate functions • Min • Euclidean • Sum, Max, …

Some Example Scoring Functions Let q=(q1, …, qn) be the target point and t=(t1, …, tn) a tuple: • Min(q, t) = min{1-|q1-t1|, …, 1-|qn-tn|} • Euclidean(q, t) = 1- sqrt((q1-t1)2/n+ … + (qn-tn)2/n)

Executing Top-K Queries • Known techniques require at least one sequential scan (or a functional index) • Evaluate Scoring_Function for each tuple • Sort tuples [Carey & Kossman ‘97; ‘98] • Question: How to avoid sequential scans? Exploit implicit selectivity of top-k queries

Mapping a Top-K Query to a Selection Query • Determine a search score S such that: • Expected # of tuples with score > S is k • No false dismissals • Turn the condition that score > S into a range selection condition(s) • Evaluate selection query using existing query processor and access paths

Mapping a Top-K Query to a Selection Query 4-bedrooms; $350,000; k=10 • Retrieve all tuples with score > 0.5 (at least k=10 tuples expected) • Analyze scoring function to determine selection range: Bedrooms: [3, 5] and Price: [$250K, $450K]

Mapping a Search Score to a Selection Range For search score S , target point q=(q1, q2),and scoring function Min: Selection range: • t1 IN [q1 - (1.0-S), q1 + (1.0-S)] • t2IN [q2 - (1.0-S), q2 + (1.0-S)]

Determining a Search Score • Monotonicity: Consider tuple t that is no further from target than t’ on any attribute:Score of t should be at least that of t’ • Therefore, Score cannot be high “far away” from target • Sphere for Euclidean • Box for Min …centered at target point “Tightness” of enclosing range varies with scoring functions b c a

The Min Scoring Function

The Euclidean Scoring Function

Comments on Mapping • Search score determines efficiency, not correctness • Issues in efficiency: • Avoid retrieving too many tuples • Avoid retrieving fewer than k top tuples (restarts) How to determine good search scores?

Determining Search Scores • Find k points in data • Compute their score • Set search score to lowest score Challenges: • Determining the initial k points to optimize execution • Taking original query into account

Using Histograms 11 4 Q 20 10

Picking KRepresentative “Tuples” • Collapse histogram bucket to a single representative point • Furthest from Q in bucket (“NoRestarts”) • Closest to Q in bucket (“Restarts”) • Assign bucket frequency to the single representative point • Include closest representative points until we have k tuples

Using Histograms:“NoRestarts” 11 4 Q 20 10

Using Histograms:“Restarts” 11 4 Q 20 10

Other Strategies for Determining Search Scores • Calculate search score for: • n = NoRestarts (“pessimistic” extreme) • r = Restarts (“optimistic” extreme) • Use intermediate scores: • Inter1 = (2n + r)/3 • Inter2= (n + 2r)/3 0 NoRestarts Restarts 1

Evaluating the Generated Selection Query • Sequential scan • Intersection of a set of indexes, followed by data access • Special case: index-only access

Indexes and Statistics • Indexes n-dim (concatenated-key) B-trees • Statistics • MaxDiff as base 1-dim histogram • Multidimensional histograms: AVI, Phased, MHist

Experimental Evaluation • Is mapping to selection queries an effective technique? • Sensitivity of relevant parameters: • Scoring functions • Data skew and dimensionality • Statistics

Data Generation Characterized by Z = <z1, …, zn> • Generate N tuples by Zipfian distribution z1 • Group tuples by attr1 • For a partition with attr1 = a with N1 tuples: • Generate N1values w1, ..., wN1 using Zipfian distribution z2 • Create pairs (a, w1), …, (a, wN1) • Repeat steps to fill in all attribute values

Metrics for Comparison • Fraction of data tuples accessed may be compared to: • Ideal: k • Worst case: size of data set • % of restarts

Exploring Limits • Intrinsic limitations of range-query approach: • Enclose actual top-k tuples in tight n-rectangle • Retrieve all tuples in n-rectangle Less than 1% of database tuples in n-rectangle(k=10; 100,000 tuples) • Effect of retrieving tuples with score > S using an n-rectangle

Effect of Scoring Functions • Min has little/no gap between target region and enclosing n-rectangle As k increases, fraction of retrieved tuples grows slowest for Min • Euclidean performs worse Less tight n-rectangle

Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Effect of Mapping Strategies and Histograms • Multidimensional histograms aid computation of tight search scores • NoRestarts dominates at high data skew

Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)

Restarts v. Data Skew(PHASED histogram of 5KB; n=3)

Related Work (1) • [Fagin ‘96; ‘98] • Multimedia attributes with query “subsystem” • Multiple index scans • Independence assumption • [Chaudhuri & Gravano ‘96] • Multimedia attributes with query “subsystem” • Map top-k queries to “selection” queries • Independence assumption • Limited scoring functions

Related Work (2) • [Carey & Kossman ‘97; ‘98] Optimized sorting phase using k • Nearest-neighbor literature • [Donjerkovic & Ramakrishnan ‘99] • Probabilistic optimization framework • No multidimensional scoring functions • Independence assumptions

Summary • Defined mapping of top-k queries to traditional selection queries Exploit existing database statistics and query processors • Studied effect of scoring functions, data skew, statistics on mapping Full experimental analysis forthcoming!

Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)

Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)

Restarts v. n(PHASED histogram of 5KB; Z21)

Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)

Restarts v. k(PHASED histogram of 5KB; Z21; n=3)

Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

Tuples Retrieved v. Histogram Size(Census Database; PHASED)

Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

The Sum Scoring Function

The Max Scoring Function

Evaluating Top- K Selection Queries

Evaluating Top- K Selection Queries

Presentation Transcript

Supporting Top- k join Queries in Relational Databases

Evaluating Top-k Queries Over Web-Accessible Databases

Answering Top-k Queries Using Views

Top-k Queries on Temporal Data

Evaluating Probability Threshold k-Nearest-Neighbor Queries over Uncertain Data

Top- k Queries on Uncertain Data

Fast Algorithms for Top-k Personalized PageRank Queries

Answering Top-k Queries Using Views

6 Rank Aggregation and Top-k Queries

Supporting top-k join queries in relational databases

Supporting top-k join queries in relational databases

Sliding-window Top-k Queries on Uncertain Streams

Answering Top-k Queries Using Views

Evaluating Queries over Route Collections

Cleaning Uncertain Data for Top-k Queries

Supporting Top- k join Queries in Relational Databases

Best Position Algorithms for Top-k Queries

Continuous Top-k Dominating Queries

Evaluating top-k Queries over Web-Accessible Databases

Reverse Top- k Queries

Evaluating Top-k Queries over Web-Accessible Databases

Supporting Top- k join Queries in Relational Databases