1 / 48

Evaluating Top- K Selection Queries

Evaluating Top- K Selection Queries. Surajit Chaudhuri Microsoft Research Luis Gravano Columbia University. Motivating Example. Find 4-bedroom houses priced at $350,000 Exact matches often too restrictive Rank of houses that are closest to specification more desirable.

zoheret
Download Presentation

Evaluating Top- K Selection Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Top-KSelection Queries Surajit ChaudhuriMicrosoft Research Luis GravanoColumbia University

  2. Motivating Example Find 4-bedroom houses priced at $350,000 • Exact matches often too restrictive • Rank of houses that are closest to specification more desirable

  3. Motivating Example (cont.) Find 4-bedroom houses priced at $350,000 • House 1: 5 bedrooms; $400,000; Score=0.9 • House 2: 4 bedrooms; $485,000; Score=0.8 • House 3: 6 bedrooms; $785,000; Score=0.3

  4. Top-K Queries over Precise Relational Data • Support approximate matches with minimal changes to the relational engine • Initial focus: Selection queries with “equality” conditions

  5. Outline • Definition of top-k queries • Execution alternatives • Mapping of top-k queries to selection queries • Experiments

  6. Top-K Selection Queries • Specify an n-dimensional target point • Define scoring function • Specify k Answer:k objects with the best score for the target point (i.e., the “top k” objects)

  7. Specifying Top-K Queries using SQL • Select * • From R • Order [k] By Scoring_Function

  8. Scoring Functions Measure Degree of Match • Assume attributes defined over metric space • Score on any one attribute is well defined • How to aggregate scores across attributes?

  9. Scoring Functions • Normalize attribute scores to be in [0,1] range • Combine scores using popular aggregate functions • Min • Euclidean • Sum, Max, …

  10. Some Example Scoring Functions Let q=(q1, …, qn) be the target point and t=(t1, …, tn) a tuple: • Min(q, t) = min{1-|q1-t1|, …, 1-|qn-tn|} • Euclidean(q, t) = 1- sqrt((q1-t1)2/n+ … + (qn-tn)2/n)

  11. Executing Top-K Queries • Known techniques require at least one sequential scan (or a functional index) • Evaluate Scoring_Function for each tuple • Sort tuples [Carey & Kossman ‘97; ‘98] • Question: How to avoid sequential scans? Exploit implicit selectivity of top-k queries

  12. Mapping a Top-K Query to a Selection Query • Determine a search score S such that: • Expected # of tuples with score > S is k • No false dismissals • Turn the condition that score > S into a range selection condition(s) • Evaluate selection query using existing query processor and access paths

  13. Mapping a Top-K Query to a Selection Query 4-bedrooms; $350,000; k=10 • Retrieve all tuples with score > 0.5 (at least k=10 tuples expected) • Analyze scoring function to determine selection range: Bedrooms: [3, 5] and Price: [$250K, $450K]

  14. Mapping a Search Score to a Selection Range For search score S , target point q=(q1, q2),and scoring function Min: Selection range: • t1 IN [q1 - (1.0-S), q1 + (1.0-S)] • t2IN [q2 - (1.0-S), q2 + (1.0-S)]

  15. Determining a Search Score • Monotonicity: Consider tuple t that is no further from target than t’ on any attribute:Score of t should be at least that of t’ • Therefore, Score cannot be high “far away” from target • Sphere for Euclidean • Box for Min …centered at target point “Tightness” of enclosing range varies with scoring functions b c a

  16. The Min Scoring Function

  17. The Euclidean Scoring Function

  18. Comments on Mapping • Search score determines efficiency, not correctness • Issues in efficiency: • Avoid retrieving too many tuples • Avoid retrieving fewer than k top tuples (restarts) How to determine good search scores?

  19. Determining Search Scores • Find k points in data • Compute their score • Set search score to lowest score Challenges: • Determining the initial k points to optimize execution • Taking original query into account

  20. Using Histograms 11 4 Q 20 10

  21. Picking KRepresentative “Tuples” • Collapse histogram bucket to a single representative point • Furthest from Q in bucket (“NoRestarts”) • Closest to Q in bucket (“Restarts”) • Assign bucket frequency to the single representative point • Include closest representative points until we have k tuples

  22. Using Histograms:“NoRestarts” 11 4 Q 20 10

  23. Using Histograms:“Restarts” 11 4 Q 20 10

  24. Other Strategies for Determining Search Scores • Calculate search score for: • n = NoRestarts (“pessimistic” extreme) • r = Restarts (“optimistic” extreme) • Use intermediate scores: • Inter1 = (2n + r)/3 • Inter2= (n + 2r)/3 0 NoRestarts Restarts 1

  25. Evaluating the Generated Selection Query • Sequential scan • Intersection of a set of indexes, followed by data access • Special case: index-only access

  26. Indexes and Statistics • Indexes n-dim (concatenated-key) B-trees • Statistics • MaxDiff as base 1-dim histogram • Multidimensional histograms: AVI, Phased, MHist

  27. Experimental Evaluation • Is mapping to selection queries an effective technique? • Sensitivity of relevant parameters: • Scoring functions • Data skew and dimensionality • Statistics

  28. Data Generation Characterized by Z = <z1, …, zn> • Generate N tuples by Zipfian distribution z1 • Group tuples by attr1 • For a partition with attr1 = a with N1 tuples: • Generate N1values w1, ..., wN1 using Zipfian distribution z2 • Create pairs (a, w1), …, (a, wN1) • Repeat steps to fill in all attribute values

  29. Metrics for Comparison • Fraction of data tuples accessed may be compared to: • Ideal: k • Worst case: size of data set • % of restarts

  30. Exploring Limits • Intrinsic limitations of range-query approach: • Enclose actual top-k tuples in tight n-rectangle • Retrieve all tuples in n-rectangle Less than 1% of database tuples in n-rectangle(k=10; 100,000 tuples) • Effect of retrieving tuples with score > S using an n-rectangle

  31. Effect of Scoring Functions • Min has little/no gap between target region and enclosing n-rectangle As k increases, fraction of retrieved tuples grows slowest for Min • Euclidean performs worse Less tight n-rectangle

  32. Tuples with Score > S v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

  33. Effect of Mapping Strategies and Histograms • Multidimensional histograms aid computation of tight search scores • NoRestarts dominates at high data skew

  34. Tuples Retrieved v. Data Skew(PHASED histogram of 5KB; n=3)

  35. Restarts v. Data Skew(PHASED histogram of 5KB; n=3)

  36. Related Work (1) • [Fagin ‘96; ‘98] • Multimedia attributes with query “subsystem” • Multiple index scans • Independence assumption • [Chaudhuri & Gravano ‘96] • Multimedia attributes with query “subsystem” • Map top-k queries to “selection” queries • Independence assumption • Limited scoring functions

  37. Related Work (2) • [Carey & Kossman ‘97; ‘98] Optimized sorting phase using k • Nearest-neighbor literature • [Donjerkovic & Ramakrishnan ‘99] • Probabilistic optimization framework • No multidimensional scoring functions • Independence assumptions

  38. Summary • Defined mapping of top-k queries to traditional selection queries Exploit existing database statistics and query processors • Studied effect of scoring functions, data skew, statistics on mapping Full experimental analysis forthcoming!

  39. Tuples Retrieved v. Histogram Size(Euclidean; n=3; Z21)

  40. Tuples Retrieved v. n(PHASED histogram of 5KB; Z21)

  41. Restarts v. n(PHASED histogram of 5KB; Z21)

  42. Tuples Retrieved v. k(PHASED histogram of 5KB; Z21; n=3)

  43. Restarts v. k(PHASED histogram of 5KB; Z21; n=3)

  44. Restarts v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

  45. Tuples Retrieved v. Histogram Size(Census Database; PHASED)

  46. Tuples Retrieved v. Data Skew(Euclidean; PHASED histogram of 5KB; n=3)

  47. The Sum Scoring Function

  48. The Max Scoring Function

More Related