480 likes | 630 Views
Evaluating Top- K Selection Queries. Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University. Motivating Example. Find 4-bedroom houses priced at $350,000 Exact matches often too restrictive Rank of houses that are closest to specification more desirable.
E N D
Evaluating Top-KSelection Queries Surajit ChaudhuriMicrosoft Research Luis GravanoColumbia University
Motivating Example Find 4-bedroom houses priced at $350,000 • Exact matches often too restrictive • Rank of houses that are closest to specification more desirable
Motivating Example (cont.) Find 4-bedroom houses priced at $350,000 • House 1: 5 bedrooms; $400,000; Score=0.9 • House 2: 4 bedrooms; $485,000; Score=0.8 • House 3: 6 bedrooms; $785,000; Score=0.3
Top-K Queries over Precise Relational Data • Support approximate matches with minimal changes to the relational engine • Initial focus: Selection queries with “equality” conditions
Outline • Definition of top-k queries • Execution alternatives • Mapping of top-k queries to selection queries • Experiments
Top-K Selection Queries • Specify an n-dimensional target point • Define scoring function • Specify k Answer:k objects with the best score for the target point (i.e., the “top k” objects)
Specifying Top-K Queries using SQL • Select * • From R • Order [k] By Scoring_Function
Scoring Functions Measure Degree of Match • Assume attributes defined over metric space • Score on any one attribute is well defined • How to aggregate scores across attributes?
Scoring Functions • Normalize attribute scores to be in [0,1] range • Combine scores using popular aggregate functions • Min • Euclidean • Sum, Max, …
Some Example Scoring Functions Let q=(q1, …, qn) be the target point and t=(t1, …, tn) a tuple: • Min(q, t) = min{1-|q1-t1|, …, 1-|qn-tn|} • Euclidean(q, t) = 1- sqrt((q1-t1)2/n+ … + (qn-tn)2/n)
Executing Top-K Queries • Known techniques require at least one sequential scan (or a functional index) • Evaluate Scoring_Function for each tuple • Sort tuples [Carey & Kossman ‘97; ‘98] • Question: How to avoid sequential scans? Exploit implicit selectivity of top-k queries
Mapping a Top-K Query to a Selection Query • Determine a search score S such that: • Expected # of tuples with score > S is k • No false dismissals • Turn the condition that score > S into a range selection condition(s) • Evaluate selection query using existing query processor and access paths
Mapping a Top-K Query to a Selection Query 4-bedrooms; $350,000; k=10 • Retrieve all tuples with score > 0.5 (at least k=10 tuples expected) • Analyze scoring function to determine selection range: Bedrooms: [3, 5] and Price: [$250K, $450K]
Mapping a Search Score to a Selection Range For search score S , target point q=(q1, q2),and scoring function Min: Selection range: • t1 IN [q1 - (1.0-S), q1 + (1.0-S)] • t2IN [q2 - (1.0-S), q2 + (1.0-S)]
Determining a Search Score • Monotonicity: Consider tuple t that is no further from target than t’ on any attribute:Score of t should be at least that of t’ • Therefore, Score cannot be high “far away” from target • Sphere for Euclidean • Box for Min …centered at target point “Tightness” of enclosing range varies with scoring functions b c a
Comments on Mapping • Search score determines efficiency, not correctness • Issues in efficiency: • Avoid retrieving too many tuples • Avoid retrieving fewer than k top tuples (restarts) How to determine good search scores?
Determining Search Scores • Find k points in data • Compute their score • Set search score to lowest score Challenges: • Determining the initial k points to optimize execution • Taking original query into account
Using Histograms 11 4 Q 20 10
Picking KRepresentative “Tuples” • Collapse histogram bucket to a single representative point • Furthest from Q in bucket (“NoRestarts”) • Closest to Q in bucket (“Restarts”) • Assign bucket frequency to the single representative point • Include closest representative points until we have k tuples
Using Histograms:“NoRestarts” 11 4 Q 20 10
Using Histograms:“Restarts” 11 4 Q 20 10
Other Strategies for Determining Search Scores • Calculate search score for: • n = NoRestarts (“pessimistic” extreme) • r = Restarts (“optimistic” extreme) • Use intermediate scores: • Inter1 = (2n + r)/3 • Inter2= (n + 2r)/3 0 NoRestarts Restarts 1
Evaluating the Generated Selection Query • Sequential scan • Intersection of a set of indexes, followed by data access • Special case: index-only access
Indexes and Statistics • Indexes n-dim (concatenated-key) B-trees • Statistics • MaxDiff as base 1-dim histogram • Multidimensional histograms: AVI, Phased, MHist
Experimental Evaluation • Is mapping to selection queries an effective technique? • Sensitivity of relevant parameters: • Scoring functions • Data skew and dimensionality • Statistics
Data Generation Characterized by Z = <z1, …, zn> • Generate N tuples by Zipfian distribution z1 • Group tuples by attr1 • For a partition with attr1 = a with N1 tuples: • Generate N1values w1, ..., wN1 using Zipfian distribution z2 • Create pairs (a, w1), …, (a, wN1) • Repeat steps to fill in all attribute values
Metrics for Comparison • Fraction of data tuples accessed may be compared to: • Ideal: k • Worst case: size of data set • % of restarts
Exploring Limits • Intrinsic limitations of range-query approach: • Enclose actual top-k tuples in tight n-rectangle • Retrieve all tuples in n-rectangle Less than 1% of database tuples in n-rectangle(k=10; 100,000 tuples) • Effect of retrieving tuples with score > S using an n-rectangle
Effect of Scoring Functions • Min has little/no gap between target region and enclosing n-rectangle As k increases, fraction of retrieved tuples grows slowest for Min • Euclidean performs worse Less tight n-rectangle
Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)
Effect of Mapping Strategies and Histograms • Multidimensional histograms aid computation of tight search scores • NoRestarts dominates at high data skew
Related Work (1) • [Fagin ‘96; ‘98] • Multimedia attributes with query “subsystem” • Multiple index scans • Independence assumption • [Chaudhuri & Gravano ‘96] • Multimedia attributes with query “subsystem” • Map top-k queries to “selection” queries • Independence assumption • Limited scoring functions
Related Work (2) • [Carey & Kossman ‘97; ‘98] Optimized sorting phase using k • Nearest-neighbor literature • [Donjerkovic & Ramakrishnan ‘99] • Probabilistic optimization framework • No multidimensional scoring functions • Independence assumptions
Summary • Defined mapping of top-k queries to traditional selection queries Exploit existing database statistics and query processors • Studied effect of scoring functions, data skew, statistics on mapping Full experimental analysis forthcoming!
Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)
Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)