Top-k Query Processing and Optimization

Top-k Query Processing and Optimization 198:541 (slides courtesy of Ihab F. Ilyas and Walid G. Aref)

Outline • Motivation with Examples • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

Motivation • Information Retrieval/ Database systems integration • IR: uncertainty andranking for effective retrieval • Database Systems: advanced data management • Supporting new data types is certainly not enough • True integration includes (among other things): • Indexing • Query processing and optimization • Query Language

Motivation • Applications • Multimedia search by contents (multi-features/examples) • Middleware • Search engines • Data mining • New requirements • Multi-criteria ranking • Rank aggregation from external sources • Joining ranked infinite streams • Most applications are interested in the top-k results

Example 1: Ranking in Multimedia Retrieval Query Color Histogram Edge Histogram Texture Video Database Color Histogram Edge Histogram Texture

RANK( ) OVER in SQL 99 Example 2 SELECT h.id , s.name FROM houses h , schools s WHERE h.location = s.location ORDER BY h.price+10 x s.tuition STOP AFTER 10

Example 2 (Cont’d) Houses Schools

Outline • Motivation with Examples • Theoretical Foundation of Rank Aggregation • Query Model • Top-k Selection • Top-k Join • Ranking in Relational Query Engines • Summary • Open Research Topics

Query Model: Top-k Selection • One relation R with: • n attributes A1, … ,An • m scores s1, …., sm • Scores are expressions over the attributes • Example: s1 = A1 and s2=A2+A3 • A combining function F to compute total score • An example template: • SELECT some_attributes • FROM R • WHERE selection_condition • ORDER BY F(s1,…, sm) • STOP AFTER k

Query Model: Top-k Join • m Relations R1, ….., Rm | Ri has: • n attributes • score attribute, si (can be an expression over other attributes) • A global score for a join result is computed as F(s1,…., sm) • An example template: SELECT some_attributes FROM R1,…..,Rm WHERE join_condition ORDER BY F(s1,…..,sm) STOP AFTER k

Top-k Selection Queries • Approaches: • Filter/Restart method • Rank aggregation from multiple lists • Using indexes and materialized views

Top-k Selection Filter/Restart Method [Carey and Kossman SIGMOD’ 97] [Donjerkovic and Ramakrishnan VLDB’99] [Bruno et al. TODS’02] [Chaudhuri et al. TKDE’04] • Map the top-k query to a selection predicate x > t • For multi-criteria ranking (multiple scoring attributes)  a range query t11< x1 <t12 and t21 < x2 < t22 • Estimate the cut-off t based on k and the data distribution

L1.ID = L2.ID ID ID ID S2 S2 S1 S1 40 20 30 10 50 50 40 30 20 10 30 40 50 20 10 50 40 30 20 10 1 2 3 4 5 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists

Top-k Selection Rank Aggregation from Multiple Lists • Assumptions: • Sorted Inputs on the individual scores • The combining function is monotone • Differ in the access capabilities on the lists • Sorted Access Only • Sorted + Random Access • Random Access Only ! • Differ in pipeline support • The output can serve as input to another instance • Most Algorithms can be cast as a specialization of the A* algorithm

Random Access/ Probes Only Sorted Access Only BOTH Top-k Selection Rank Aggregation from Multiple Lists FA, TA, Quick-combine, Multi-Step NRA, Stream-combine MPro, Upper, Pick

Top-k Selection Rank Aggregation from Multiple Lists • Sorted + Random Access Available • Go in the lists in parallel • Keep track of all “seen” objects • Update scores of objects at each step • Maintain a threshold T: an upper-bound for all the unknown scores • An object is qualified as a top-k if • The object’s combined score is known and is greater than T

ID ID S2 S1 50 40 30 20 10 50 40 30 20 10 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists TA [Fagin et al. PODS’01], Quick-combine[Güntzer et al. VLDB’00], Multi-step[Nepal and Ramakrishna ICDE’99] Algorithm Sketch: F= S1 + S2 T = 100 T = 80 3: (80) 1: (70) 5: (60) 2: (60) 3: (80) 5: (60) Buffer Random Access

Top-k Selection Rank Aggregation from Multiple Lists • No Random Access available • The combined score of an object is two parts: • “Seen”: From lists where we encountered the object • “Unseen”: Upper bound of all missing scores • Sounds familiar?  A* search • An object is qualified as a top-k if • The object’s lower-bound score is greater than the upper-bound score of all other objects

ID ID S2 S1 50 40 30 20 10 50 40 30 20 10 3 2 1 4 5 5 1 3 2 4 Top-k Selection Rank Aggregation from Multiple Lists NRA[Fagin et al. PODS’01], Stream-combine[Güntzer et al. ITCC’00] Algorithm Sketch: F= S1 + S2 5: (50 – 100) 3: (50 – 100) 5: (50 – 90) 3: (50 – 90) 1: (40 – 80) 2: (40 – 80) 3: (80 – 80) 1: (70 – 70) 5: (50 – 80) 2: (40 – 70) Buffer

Top-k Selection Rank Aggregation from Multiple Lists • At least one attribute with sorted-access • Some with no Sorted Access (Probe Attributes) • Schedule the Probe Attributes using statistics in ascending order of their “probing” cost

Top-k Selection Rank Aggregation from Multiple Lists Upper[Bruno et al. ICDE’02] MPro [Chang and Hwang SIGMOD’02], Algorithm Sketch: Upper-bound of the unseen scores Uunseen= 0.9 0.8 0.7 a: 0.85 b: 0.8 a: 0.9 b: 0.8 a: 0.75 b: 0.78 a: 0.75 b: 0.78 a: 0.75 c: 0.7 b: 0.78 a: 0.75 c: 0.7 a: 0.85 Candidates Queue

m å = f v . Score v i i = i 1 Top-k SelectionMaterialized Views (PREFER) [Hristidis et al. SIGMOD’01] • Ranking function • Materialize a view that ranks the entire input relation on fv | v =(v1,v2,...,vm) (predefined weights) • For an input query weights q =(q1,q2,…,qm) • Get a prefix of the view based on v and q • Sort the prefix on fq • The top results are guaranteed to be in the prefix

" Î < Þ < 1 t R , f ( t ) T f ( t ) f ( t ) , v v q q q v Top-k SelectionMaterialized Views (PREFER) • The prefix: • Determine a threshold Tv,q| v =(0.2,0.4,0.4) q =(0.1,0.6,0.3) Maximize fv(t) while maintaining inequality Tv,q = 14.26

Top-k SelectionMaterialized Views (PREFER) • Multiple views • For each query q, a view v that covers q with some prefix constraint • 10 to 100 views is a typical number to cover the space of queries • Heuristic to cover the maximum number of queries using n views • View selection for an input query • Store view coverage in an R-tree • A query is a point in the space • Get the view that contains the query

Top-k Join • m Relations R1, ….., Rm | Ri has: • n attributes • score attribute, si (can be an expression over other attributes) • A global score for a join result is computed as F(s1,…., sm) • An example template: SELECT some_attributes FROM R1,…..,Rm WHERE join_condition ORDER BY F(s1,…..,sm) STOP AFTER k

Supporting Ranking in Relational Databases Approaches: • Map to a multi-dimensional range query (Filter/Restart) • User defined function • Core implementation as a query operator • New Algebra • The notion of ranked relations • Algorithms implement the ranking process

Ranking in Relational DatabasesRange queries [Bruno et al. TODS’02] • Filter/Restart Method • Given a top-k selection query q over a relation R • Use multidimensional histogram H to estimate a search distance dq • The region (q,dq) contains is expected to include the top-k answers • Perform a range query on (q,dq) • Return top-k answers of the results • If #results < k, RESTART

Ranking in Relational DatabasesRange queries • Using histograms • Create a small synthetic Relation S consistent with the histogram on R • Compute the Dist(q,t) for every tuple t in S • dq is the maximum distance between q and the top-k tuples in S • Building S • No restarts: dq is large enough more results need to be filtered • Restarts: restarts possible  less filtering

Supporting Ranking in Relational Databases

Summary • Wide applicability of ranking queries as essential functionality in many applications warrants efficient handling by database systems • One step towards integrating IR effective retrieval and DB effective handling of data • Defined two flavors: • Top-k Selection Queries • Top-k Join Queries • Basic Techniques: • Filter / Restart • Rank Aggregation • Using Indexes and Materialized Views

Summary • Rank Aggregation has solid theoretical background from voting and social selection • Many rank-aggregation algorithms available with similar core idea • Challenges • Unified framework for rank-aware query processing • Integrating rank-aggregation as a basic operation in practical relational database systems

References (1) • [ACE+03] Walid G. Aref, Ann C. Catlin, Ahmed K. Elmagarmid, J. Fan, Moustafa A. Hammad, Ihab F. Ilyas, Mirette Marzouk, Sunil Prabhakar, and X. Zhu. VDBMS: A testbed facility for research in video database bench marking. ACM Multimedia Systems Journal, Special Issue on Multimedia Document Management Systems, 2003. • [ASC02] Sihem Amer-Yahia, SungRan Cho, Divesh Srivastava, Tree Pattern Relaxation, In EDBT, 2002 • [BCG02] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. TODS, 27(2), 2002. • [BGM02] Nicolas Bruno, Luis Gravano, Amélie Marian: Evaluating Top-k Queries over Web-Accessible Databases. In ICDE, 2002 • [CBC+00] YuanChi Chang, Lawrence Bergman, Vittorio Castelli, ChungSheng Li, MingLing Lo, and John R. Smith. The onion technique: indexing for linear optimization queries. In SIGMOD, 2000. • [CK97] Michael J. Carey and Donald Kossmann, On saying ``Enough already !” in SQL, SIGMOD, 1997 Tucson, Arizona

References (2) • [CH02] Kevin ChenChuan Chang and Seung won Hwang. Minimal probing: supporting expensive predicates for top-k queries. In SIGMOD, 2002. • [Con85] M.J. Condorcet. Essai sur l'application de l'analyse a la probabilit e des decisions rendues a la puralite des voix, 1785. • [DKN+01] Cynthia Dwork, S. Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In World Wide Web, 2001. • [DR99] Donko Donjerkovic, Raghu Ramakrishnan: Probabilistic Optimization of Top N Queries. In VLDB 1999 • [Fag99] Ronald Fagin. Combining fuzzy information from multiple systems. Journal of Computer and System Sciences (JCSS), 58(1), Feb 1999. • [FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In PODS, Santa Barbara, California, May 2001. • [GBK00] Ulrich G˜untzer, WolfTilo Balke, and Werner Kießling. Optimizing multifeature queries for image databases. In VLDB, September 10--14, Cairo, Egypt, 2000.

References (3) • [GBK01] Ulrich G˜untzer, WolfTilo Balke, and Werner Kießling. Towards efficient multifeature queries in heterogeneous environments. In ITCC, 2001. • [HGP03] Vagelis Hristidis, Luis Gravano, and Yannis Papakonstantinou. Efficient IR-style keyword search over relational databases. In VLDB, Berlin, Germany, 2003. • [HKP01] Vagelis Hristidis, Nick Koudas, and Yannis Papakonstantinou. PREFER: A system for the efficient execution of multiparametric ranked queries. In SIGMOD, Santa Barbara, California, 2001 • [IAE02] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Joining Ranked Inputs in Practice. In VLDB, Honk-Kong, China, 2002. • [IAE03] Ihab F. Ilyas, Walid G. Aref, and Ahmed K. Elmagarmid. Supporting top-k join queries in relational databases. In VLDB, Berlin, Germany, 2003. • [ISA+04] Ihab F. Ilyas, Rahul Shah, Walid G. Aref, Jeff Vitter, and Ahmed K. Elmagarmid. Rank-aware Query Optimization. SIGMOD, Paris, France, 2004

References (4) • [LCI+05] Chengkai Li, Kevin. C.-C. Chang, Ihab F. Ilyas, and Sumin Song RankSQL: Query Algebra and Optimization for Relational Top-k Queries. In Proceedings of the 2005 ACM SIGMOD Conference on Management of Data, Baltimore, Maryland (To Appear) • [NCS+01] Apostol Natsev, YuanChi Chang, John R. Smith, ChungSheng Li, and Jeffrey Scott Vitter. Supporting incremental join queries on ranked inputs. In VLDB, Rome, Italy, 2001. • [NR99] Surya Nepal and M. V. Ramakrishna. Query processing issues in image (multimedia) databases. In ICDE, Sydney, Australia, 1999. • [RGM03] Sriram Raghavan and Hector GarciaMolina. Complex queries over web repositories. In VLDB, Berlin, Germany, 2003. • [TPK+03] Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, and Divesh Srivastava. Ranked join indices, ICDE 2003. • [UF01] Tolga Urhan and Michael J. Franklin. Dynamic pipeline scheduling for improving interactive query performance. In VLDB, Roma, Italy, 2001.

Top-k Query Processing and Optimization