1 / 39

Top-k Query Processing and Optimization

Top-k Query Processing and Optimization. 198:541 (slides courtesy of Ihab F. Ilyas and Walid G. Aref). Outline. Motivation with Examples Query Model Top-k Selection Top-k Join Ranking in Relational Query Engines Summary Open Research Topics. Outline. Motivation with Examples

miracle
Download Presentation

Top-k Query Processing and Optimization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Top-k Query Processing and Optimization 198:541 (slides courtesy of Ihab F. Ilyas and Walid G. Aref)

  2. Outline • Motivation with Examples • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  3. Outline • Motivation with Examples • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  4. Motivation • Information Retrieval/ Database systems integration • IR: uncertainty andranking for effective retrieval • Database Systems: advanced data management • Supporting new data types is certainly not enough • True integration includes (among other things): • Indexing • Query processing and optimization • Query Language

  5. Motivation • Applications • Multimedia search by contents (multi-features/examples) • Middleware • Search engines • Data mining • New requirements • Multi-criteria ranking • Rank aggregation from external sources • Joining ranked infinite streams • Most applications are interested in the top-k results

  6. Example 1: Ranking in Multimedia Retrieval Query Color Histogram Edge Histogram Texture Video Database Color Histogram Edge Histogram Texture

  7. RANK( ) OVER in SQL 99 Example 2 SELECT h.id , s.name FROM houses h , schools s WHERE h.location = s.location ORDER BY h.price+10 x s.tuition STOP AFTER 10

  8. Example 2 (Cont’d) Houses Schools

  9. Outline • Motivation with Examples • Theoretical Foundation of Rank Aggregation • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  10. Query Model: Top-k Selection • One relation R with: • n attributes A1, … ,An • m scores s1, …., sm • Scores are expressions over the attributes • Example: s1 = A1 and s2=A2+A3 • A combining function F to compute total score • An example template: • SELECT some_attributes • FROM R • WHERE selection_condition • ORDER BY F(s1,…, sm) • STOP AFTER k

  11. Query Model: Top-k Join • m Relations R1, ….., Rm | Ri has: • n attributes • score attribute, si (can be an expression over other attributes) • A global score for a join result is computed as F(s1,…., sm) • An example template: SELECT some_attributes FROM R1,…..,Rm WHERE join_condition ORDER BY F(s1,…..,sm) STOP AFTER k

  12. Top-k Selection Queries • Approaches: • Filter/Restart method • Rank aggregation from multiple lists • Using indexes and materialized views

  13. Top-k Selection Filter/Restart Method [Carey and Kossman SIGMOD’ 97] [Donjerkovic and Ramakrishnan VLDB’99] [Bruno et al. TODS’02] [Chaudhuri et al. TKDE’04] • Map the top-k query to a selection predicate x > t • For multi-criteria ranking (multiple scoring attributes)  a range query t11< x1 <t12 and t21 < x2 < t22 • Estimate the cut-off t based on k and the data distribution

  14. L1.ID = L2.ID ID ID ID S2 S2 S1 S1 40 20 30 10 50 50 40 30 20 10 30 40 50 20 10 50 40 30 20 10 1 2 3 4 5 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists

  15. Top-k Selection Rank Aggregation from Multiple Lists • Assumptions: • Sorted Inputs on the individual scores • The combining function is monotone • Differ in the access capabilities on the lists • Sorted Access Only • Sorted + Random Access • Random Access Only ! • Differ in pipeline support • The output can serve as input to another instance • Most Algorithms can be cast as a specialization of the A* algorithm

  16. Random Access/ Probes Only Sorted Access Only BOTH Top-k Selection Rank Aggregation from Multiple Lists FA, TA, Quick-combine, Multi-Step NRA, Stream-combine MPro, Upper, Pick

  17. Top-k Selection Rank Aggregation from Multiple Lists • Sorted + Random Access Available • Go in the lists in parallel • Keep track of all “seen” objects • Update scores of objects at each step • Maintain a threshold T: an upper-bound for all the unknown scores • An object is qualified as a top-k if • The object’s combined score is known and is greater than T

  18. ID ID S2 S1 50 40 30 20 10 50 40 30 20 10 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists TA [Fagin et al. PODS’01], Quick-combine[Güntzer et al. VLDB’00], Multi-step[Nepal and Ramakrishna ICDE’99] Algorithm Sketch: F= S1 + S2 T = 100 T = 80 3: (80) 1: (70) 5: (60) 2: (60) 3: (80) 5: (60) Buffer Random Access

  19. Top-k Selection Rank Aggregation from Multiple Lists • No Random Access available • The combined score of an object is two parts: • “Seen”: From lists where we encountered the object • “Unseen”: Upper bound of all missing scores • Sounds familiar?  A* search • An object is qualified as a top-k if • The object’s lower-bound score is greater than the upper-bound score of all other objects

  20. ID ID S2 S1 50 40 30 20 10 50 40 30 20 10 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists NRA[Fagin et al. PODS’01], Stream-combine[Güntzer et al. ITCC’00] Algorithm Sketch: F= S1 + S2 5: (50 – 100) 3: (50 – 100) 5: (50 – 90) 3: (50 – 90) 1: (40 – 80) 2: (40 – 80) 3: (80 – 80) 1: (70 – 70) 5: (50 – 80) 2: (40 – 70) Buffer

  21. Top-k Selection Rank Aggregation from Multiple Lists • At least one attribute with sorted-access • Some with no Sorted Access (Probe Attributes) • Schedule the Probe Attributes using statistics in ascending order of their “probing” cost

  22. Top-k Selection Rank Aggregation from Multiple Lists Upper[Bruno et al. ICDE’02] MPro [Chang and Hwang SIGMOD’02], Algorithm Sketch: Upper-bound of the unseen scores Uunseen= 0.9 0.8 0.7 a: 0.85 b: 0.8 a: 0.9 b: 0.8 a: 0.75 b: 0.78 a: 0.75 b: 0.78 a: 0.75 c: 0.7 b: 0.78 a: 0.75 c: 0.7 a: 0.85 Candidates Queue

  23. m å = f v . Score v i i = i 1 Top-k SelectionMaterialized Views (PREFER) [Hristidis et al. SIGMOD’01] • Ranking function • Materialize a view that ranks the entire input relation on fv | v =(v1,v2,...,vm) (predefined weights) • For an input query weights q =(q1,q2,…,qm) • Get a prefix of the view based on v and q • Sort the prefix on fq • The top results are guaranteed to be in the prefix

  24. " Î < Þ < 1 t R , f ( t ) T f ( t ) f ( t ) , v v q q q v Top-k SelectionMaterialized Views (PREFER) • The prefix: • Determine a threshold Tv,q| v =(0.2,0.4,0.4) q =(0.1,0.6,0.3) Maximize fv(t) while maintaining inequality Tv,q = 14.26

  25. Top-k SelectionMaterialized Views (PREFER) • Multiple views • For each query q, a view v that covers q with some prefix constraint • 10 to 100 views is a typical number to cover the space of queries • Heuristic to cover the maximum number of queries using n views • View selection for an input query • Store view coverage in an R-tree • A query is a point in the space • Get the view that contains the query

  26. Outline • Motivation with Examples • Theoretical Foundation of Rank Aggregation • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  27. Top-k Join • m Relations R1, ….., Rm | Ri has: • n attributes • score attribute, si (can be an expression over other attributes) • A global score for a join result is computed as F(s1,…., sm) • An example template: SELECT some_attributes FROM R1,…..,Rm WHERE join_condition ORDER BY F(s1,…..,sm) STOP AFTER k

  28. Outline • Motivation with Examples • Theoretical Foundation of Rank Aggregation • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  29. Supporting Ranking in Relational Databases Approaches: • Map to a multi-dimensional range query (Filter/Restart) • User defined function • Core implementation as a query operator • New Algebra • The notion of ranked relations • Algorithms implement the ranking process

  30. Ranking in Relational DatabasesRange queries [Bruno et al. TODS’02] • Filter/Restart Method • Given a top-k selection query q over a relation R • Use multidimensional histogram H to estimate a search distance dq • The region (q,dq) contains is expected to include the top-k answers • Perform a range query on (q,dq) • Return top-k answers of the results • If #results < k, RESTART

  31. Ranking in Relational DatabasesRange queries • Using histograms • Create a small synthetic Relation S consistent with the histogram on R • Compute the Dist(q,t) for every tuple t in S • dq is the maximum distance between q and the top-k tuples in S • Building S • No restarts: dq is large enough more results need to be filtered • Restarts: restarts possible  less filtering

  32. Supporting Ranking in Relational Databases

  33. Outline • Motivation with Examples • Theoretical Foundation of Rank Aggregation • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

  34. Summary • Wide applicability of ranking queries as essential functionality in many applications warrants efficient handling by database systems • One step towards integrating IR effective retrieval and DB effective handling of data • Defined two flavors: • Top-k Selection Queries • Top-k Join Queries • Basic Techniques: • Filter / Restart • Rank Aggregation • Using Indexes and Materialized Views

  35. Summary • Rank Aggregation has solid theoretical background from voting and social selection • Many rank-aggregation algorithms available with similar core idea • Challenges • Unified framework for rank-aware query processing • Integrating rank-aggregation as a basic operation in practical relational database systems

  36. References (1) • [ACE+03] Walid G. Aref, Ann C. Catlin, Ahmed K. Elmagarmid, J. Fan, Moustafa A. Hammad, Ihab F. Ilyas, Mirette Marzouk, Sunil Prabhakar, and X. Zhu. VDBMS: A testbed facility for research in video database bench marking. ACM Multimedia Systems Journal, Special Issue on Multimedia Document Management Systems, 2003. • [ASC02] Sihem Amer-Yahia, SungRan Cho, Divesh Srivastava, Tree Pattern Relaxation, In EDBT, 2002 • [BCG02] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. TODS, 27(2), 2002. • [BGM02] Nicolas Bruno, Luis Gravano, Amélie Marian: Evaluating Top-k Queries over Web-Accessible Databases. In ICDE, 2002 • [CBC+00] YuanChi Chang, Lawrence Bergman, Vittorio Castelli, ChungSheng Li, MingLing Lo, and John R. Smith. The onion technique: indexing for linear optimization queries. In SIGMOD, 2000. • [CK97] Michael J. Carey and Donald Kossmann, On saying ``Enough already !” in SQL, SIGMOD, 1997 Tucson, Arizona

  37. References (2) • [CH02] Kevin ChenChuan Chang and Seung won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. • [Con85] M.J. Condorcet. Essai sur l'application de l'analyse a la probabilit e des decisions rendues a la puralite des voix, 1785. • [DKN+01] Cynthia Dwork, S. Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In World Wide Web, 2001. • [DR99] Donko Donjerkovic, Raghu Ramakrishnan: Probabilistic Optimization of Top N Queries. In VLDB 1999 • [Fag99] Ronald Fagin. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences (JCSS), 58(1), Feb 1999. • [FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In PODS, Santa Barbara, California, May 2001. • [GBK00] Ulrich G˜untzer, WolfTilo Balke, and Werner Kießling. Optimizing multifeature queries for image databases. In VLDB, September 10--14, Cairo, Egypt, 2000.

  38. References (3) • [GBK01] Ulrich G˜untzer, WolfTilo Balke, and Werner Kießling. Towards efficient multifeature queries in heterogeneous environments. In ITCC, 2001. • [HGP03] Vagelis Hristidis, Luis Gravano, and Yannis Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, Berlin, Germany, 2003. • [HKP01] Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. PREFER: A system for the efficient execution of multiparametric ranked queries. In SIGMOD, Santa Barbara, California, 2001 • [IAE02] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Joining Ranked Inputs in Practice. In VLDB, Honk-Kong, China, 2002. • [IAE03] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, Berlin, Germany, 2003. • [ISA+04] Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeff Vitter, and Ahmed K. Elmagarmid. Rank-aware Query Optimization. SIGMOD, Paris, France, 2004

  39. References (4) • [LCI+05] Chengkai Li, Kevin. C.-C. Chang,  Ihab F. Ilyas, and  Sumin Song RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In Proceedings of the 2005 ACM SIGMOD Conference on Management of Data, Baltimore, Maryland (To Appear) • [NCS+01] Apostol Natsev, YuanChi Chang, John R. Smith, ChungSheng Li, and Jeffrey Scott Vitter. Supporting incremental join queries on ranked inputs. In VLDB, Rome, Italy, 2001. • [NR99] Surya Nepal and M. V. Ramakrishna. Query processing issues in image (multimedia) databases. In ICDE, Sydney, Australia, 1999. • [RGM03] Sriram Raghavan and Hector GarciaMolina. Complex queries over web repositories. In VLDB, Berlin, Germany, 2003. • [TPK+03] Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, and Divesh Srivastava. Ranked join indices, ICDE 2003. • [UF01] Tolga Urhan and Michael J. Franklin. Dynamic pipeline scheduling for improving interactive query performance. In VLDB, Roma, Italy, 2001.

More Related