500 likes | 665 Views
Supporting Ranking in Queries Score-based Paradigm. Russell Greenspan CS 411 Spring 2004. Supporting Ranking in Queries Talk Outline. What Why How “Out-of-the-box” support “Smart” top- k processing. Ranking in Queries What is ranking in queries?.
E N D
Supporting Ranking in QueriesScore-based Paradigm Russell GreenspanCS 411Spring 2004
Supporting Ranking in QueriesTalk Outline • What • Why • How • “Out-of-the-box” support • “Smart” top-k processing
Ranking in Queries What is ranking in queries? • A mechanism to return only the top-k results • Closest matches to user-specified boolean criteria • Scoring results based on user-specified predicates • SELECT AddressFROM HousesForSaleORDER BY Best(Size, Price) • Express similarity, relevance, or preference to a given query
What is ranking in queries?Definitions • Intuitive • Output an ordered list of k items such that the list includes only those items whose scored rank is greater than the items not included • Formal • “Given retrieval size k and scoring function F, a ranked query returns a list K of k objects (i.e. |K| = k) with query scores, sorted in a descending order, such that F(t1, ..., tn) [u] > F(t1, ..., tn) [v] for all u in K and all v not in K.” [Chang, Hwang, 2002]
What is ranking in queries?Differences from traditional queries • How does this differ from traditional queries? • Traditional queries: • Do not stop processing until all results are computed • Do not focus on ranking tuples to best match the input query • Traditional boolean queries: • Do not return “close” matches • Can “over” or “under” match, producing too few or too many results
Ranking in Queries Why use ranking in queries? • Exact matches not required • Often times something “close enough” satisfies a user’s demands • Fuzzy matches desired • Multimedia/image matching, where the very nature of the query does not involve an exact match • Avoid unnecessary computations • Find the “best” answers quickly as opposed to all answers
Ranking in Queries How do we use execute ranked queries? • “Out-of-the-box” support • Perform query as any other, then perform sort and return only first k rows • Why is this bad? • Lots of unnecessary processing • Waste of resources in intermediate results • If scoring function is expensive, could result in computation of unneeded scores • Can we do better?
How do we use execute ranked queries?“Smart” Ranked Query Execution • Query Processing • Try to achieve significant reduction in query execution time • Use mid-query (i.e. as query executes) techniques to optimize query plan for top-k results • Consider minimal amount of tuples necessary to return k results • Scoring Predicate • Consider expense of scoring function in determining optimal query plan
“Smart” Ranked Query ExecutionTwo Areas of Research Focus • Top-k processing • Reducing number of tuples considered at each intermediate step • Assume minimal work necessary to retrieve items sorted by score (i.e. indexes on simple attributes) • Rank function • Reducing number of calls to ranking function • Assume rank calculation is expensive • Implementing unusual ranking function
“Smart” Ranked Query ExecutionResearch and Techniques • Reducing number of tuples considered • Middleware/Multimedia • Garlic [Fagin, 1999] • CHITRA [Nepal, Ramakrishna, 1999] • Relational • STOP Operator [Carey, Kossmann, 1997] • Probabilistic [Donjerkovic, Ramakrishnan, 1999] • Statistical [Chaudhuri, Gravano, 1999] • Reducing number of calls to ranking function • MPro [Chang, Hwang, 2002] • Implementing unusual ranking function • AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003]
“Smart” Ranked Querying (Middleware) –Garlic [Fagin, 1999] • Integrates data from different database systems or non-database data servers • Relational Query Set vs. “Sorted List” • Example: “Return the reddest covers of Beatle’s albums”i.e. (Artist = ‘Beatles’) AND (AlbumColor LIKE ‘red’), where Artists are stored relationally and Album colors in a multimedia database • Assign grade to each object • Boolean grade either 0 or 1 • Fuzzy value 0<=x<=1 indicating closeness
Garlic [Fagin, 1999]Rank Processing Methods • How to combine two fuzzy values to retrieve top-k objects? • Inefficient • Consider graded sets of all objects by color and shape • Compute combined score for every object, then output top k objects • Efficient • Retrieve objects (sorted by grade) from each subsystem until there are at least k of the same objects in each set • Compute combined score for each of these k objects
Garlic [Fagin, 1999]Example Query • Example: (use combined scoring function = x * y)Return Top 2 Color = ‘red’ AND Shape = ‘round’
Garlic [Fagin, 1999]Inefficient vs. Efficient Processing • Inefficient • Calculate combined score for every object • Sort by score • Return top k objects • {G, B}
Garlic [Fagin, 1999]Inefficient vs. Efficient Processing • Efficient (Fagin’s A0 algorithm) • Consider ordered members from each set until there are k of the same object in each set • A1 = {G(.9), D(.8), B(.6)} • A2 = {E(.9), B(.8), G(.7)} • Calculate combined score for each of the k objects • G = .9 * .7 = .63 • B = .6 * .8 = .48 • Return these objects ordered by combined score • {G, B}
Garlic [Fagin, 1999]Conclusions • Why is this more efficient? • Incur expense of scoring function k times, as opposed to n times (where n is the total number of items) • Access each subsystem at least k and at most n times, as opposed to n times (again, where n is the total number of items)
“Smart” Ranked Querying (Middleware) –CHITRA [Nepal, Ramakrishna, 1999] • Expands on Fagin’s GARLIC system by proposing new “multi-step” processing algorithm • Experimental results show 50% improvement
CHITRA [Nepal, Ramakrishna, 1999]“Multi-step” Algorithm • Consider first sorted item x from each subsystem i • Perform random access into every other subsystem to obtain other rankings of x • Add object to result set if its rank is greater than the threshold grade, quit when we have k objects • Threshold is score of all objects considered each iteration
CHITRA [Nepal, Ramakrishna, 1999]Example Query • Back to our example...Return Top 2 Color = ‘red’ AND Shape = ‘round’
CHITRA [Nepal, Ramakrishna, 1999]Example Scoring Functions Results • Consider two scoring functions as examples: • min[x, y] • [x * y]
CHITRA [Nepal, Ramakrishna, 1999]Conclusions • Why is this more efficient? • Requires fewer accesses to each subsystem • How do we know this algorithm is correct? • Proof by contradiction • Assume object z which should have been included • If Rank(z) > Rank(y), either: • y must have at least one subsystem rank smaller than all subsystem ranks of z • z must have at least one subsystem rank greater than all subsystem ranks of y • However, since Rank(z) < Threshold and Rank(y) >= Threshold, Rank(z) cannot be greater than Rank(y)
“Smart” Ranked Querying (Relational) –STOP Operator [Carey, et al, 1997] • Specifies extension to SQL-92 standard to allow limit on cardinality of result • STOP AFTER • Return subset of results from each section of query plan • Implement with STOP operator • STOP(N, D, E) where N is the number of desired tuples, D is the Sort Directive [asc, desc, none], and E is the Sort Expression • Heuristically determine when and how to apply
STOP Operator [Carey, et al, 1997] Example query plans • Fig a shows traditional JOIN • Join all EMP to DEPT, sort, output top k • Fig b shows implementation of STOP operators • Based on cardinality estimates, only 20 rows of EMP need be joined with 30 rows of DEPT to produce top-k of 10
STOP Operator [Carey, et al, 1997]Conservative Heuristic • Ensures that every tuple in each intermediate result is guaranteed to generate at least one tuple of the overall query result • Advantages • No restarts from intermediate processing returning fewer than k results • Intermediate STOP operators take their N value from overall query k value • Disadvantages • Only inserts STOP operators where all remaining predicates are non-reductive (cannot use with multi-way joins)
STOP Operator [Carey, et al, 1997]Aggressive Heuristic • Applies STOP operator wherever it may be beneficial, thus reducing intermediate results to a greater degree • Choose N value using cardinality estimates • Requires RESTART operator when intermediate processing returns too few results
STOP Operator [Carey, et al, 1997]Experimental Results • Which heuristic is better? • Depends on cardinality, expense of processing intermediate results, accuracy of prediction, etc. • With low expense of processing intermediate results, experimental results show aggressive overestimation the best:
STOP Operator [Carey, et al, 1997]Experimental Results • Performance vs. Traditional (“out-of-the-box”) processing shows benefits in both indexed and non-indexed situations
“Smart” Ranked Querying (Relational) –Probabilistic [Donjerkovic, et al, 1999] • Introduces idea of ‘selection cutoff’ to produce top k results without requiring SORT • Quantifies the risk of fewer than k results being generated using inherent database statistics • List the top 10 paid employeesbecomesList the employees whose salary is greater than x where x is determined by the distribution of employees’ salaries
Probabilistic [Donjerkovic, et al, 1999]Comparison with STOP Operator • In theory, likely to be cheaper to simply ‘select’ the necessary intermediate rows using cutoff (fig b) rather than performing sort and returning top-k (fig a)
Probabilistic [Donjerkovic, et al, 1999]Implementation • Leverage same statistics used by traditional query optimizer to guess cutoff • Histograms • Selectivity factors
Probabilistic [Donjerkovic, et al, 1999]Performance • For simple query using no indexes (return k highest paid employees, no index on ‘Salary’ attribute), easily outperforms traditional (scan, sort, return top k) • Also provides benefit to JOIN queries due to complexity of estimating join selectivity
“Smart” Ranked Querying (Relational) – Statistical [Chaudhuri, Gravano, 1999] • Expansion of probabilistic model • Maps rank queries into boolean range queries • Works with a variety of scoring functions, including Min, Euclidean, and Sum
Statistical [Chaudhuri, Gravano, 1999]Expansion of probabilistic model • Consider multiple levels of ‘selection cutoff’, here referred to as ‘search score’ (Sq) • NoRestarts – score low enough to guarantee no restarts are even needed • Restarts – score high enough that restarts might result • Intermediate – score between NoRestarts and Restarts
Statistical [Chaudhuri, Gravano, 1999]Implementation • Determine Sq from histograms • Choose bounding tuples in each bucket to ensure NoRestarts (fig a) or tight tuples to minimize selection but potentially require Restarts (fig b)
Statistical [Chaudhuri, Gravano, 1999]Implementation • Determine relational query to retrieve all tuples that score above Sq • Compute n-rectangle bounding such tuples • SELECT *FROM RWHERE (a1<=A1<=b1) ... AND ... (an<=An<=bn) • Compute score for all returned tuples • Output top-k tuples with score > Sq or rerun query with lower search score
Statistical [Chaudhuri, Gravano, 1999]Expansion of Fagin’s model • Expands Fagin’s ideas to relational queries • Substitute ‘search score’ query to determine top tuples for each subsystem • Use NoRestarts strategy to ensure that expensive re-querying is avoided
“Smart” Ranked Querying (Rank) – MPro [Chang, Hwang, 2002] • Extends consideration of top-k querying to expensive predicates (monotonic only) • As opposed to other work, which assumes the expense of score calculation to be minimal • Attempt to minimize the number of scores calculated • Consider only Necessary Probes, i.e. only those calculations without which the top-k results cannot be found
MPro [Chang, Hwang, 2002]Determining if probe is necessary • An object’s lowest calculated score represents “ceiling score” (i.e. it is impossible for any other score for that object to raise its lowest score) • If “ceiling score” falls below top-k object’s complete score, object is ruled out and no further calculations on the object need be performed • Simple Example: • Consider scoring function like Min and top-1 results desired • If we know object A’s combined rank with respect to F(x) and F(y) is .8, and we calculate object B’s score with respect to F(x) to be .7, B’s score with respect to F(y) need not be calculated (its Min value cannot be higher than .7)
MPro [Chang, Hwang, 2002] Determining all necessary probes • Only objects with ceiling scores in the top-k need be further evaluated • If objects are kept in sorted order by current ceiling scores: • For any object u in the top-k slots, its next probe is necessary
MPro [Chang, Hwang, 2002]Minimal Probes Algorithm (MPro) • Priority queue initialization • Evaluate each object over first predicate (same as sequentially accessing objects sorted by x) • Necessary probing • Request from queue the object with highest ceiling score • Evaluate object over next predicate y • Update ceiling score and reinsert into queue • Stop when at least k objects have been completely scored (and output these objects)
MPro [Chang, Hwang, 2002]Further Applications • Incremental results • Output top k, resume processing where it left off for next k as user requests • Fuzzy joins • Consider join predicate in same manner • Parallel processing • Distribute necessary probes across multiple servers • Distribute data, calculate top-n over each chunk, merge results
MPro [Chang, Hwang, 2002]Experimental Results • On experimental dataset, over 96% of complete probes found to be unnecessary • Elapsed time significantly improved (see below), from 21009 to 408 seconds for k = 10
“Smart” Ranked Querying (Rank) – AutoRank [Agrawal, et al, 2003] • Consider ranking of relational attributes in similar way to Information Retrieval (IR) • IDF Similarity • Extend TF-IDF based on frequency of occurrence of attribute values • QF Similarity • Use database workload to determine frequency with which attributes and attribute values are referenced • “Poor man’s relevance feedback” • ITA • Index-based top-k algorithm that exploits above ranking functions
AutoRank [Agrawal, et al, 2003]IDF Similarity • Extend TF (term frequency) • IR – frequency of terms in a document • Relational – frequency of values for an attribute • Extend IDF (inverse document frequency) • IR – total documents / documents containing term • Relational – tuples / tuples where attribute = value • For all tuples matching the queried value, IDF Similarity is the attribute’s IDF (for the queried value), and 0 otherwise
AutoRank [Agrawal, et al, 2003]QF Similarity • Consider problem of IDF where desired result is also the most frequent • Realty database where homes built in the last three years are most desired, but the few entries existing for old homes (with higher IDF) will be considered “top” • Instead, use frequency of occurrence of attribute values in executed queries to determine ranking (by examining workload) • Can extend workload analysis to draw comparative conclusions from attribute values queried together • Assume similarity between ‘Honda’ and ‘Toyota’ if users frequently look for cars by either of these manufacturers
AutoRank [Agrawal, et al, 2003]Implementation • Store approximate representations of IDF and QF values using smooth function • Minimal storage required • IDF and QF values can be quickly retrieved at runtime • ITA (Index-based Threshold Algorithm) • Use available, existing indexes (B+ trees) • Define threshold by computing best tuple in data not yet examined • Stop processing when similarity of this tuple is no greater than similarity of lowest ranking tuple in top-k buffer
AutoRank [Agrawal, et al, 2003]Experimental Results • Used large realtor database from http://homeadvisor.microsoft.com and MS- SQL Server • Measured result-quality via user studies • For each test query, asked users to identify relevant and irrelevant tuples and compared results of QF and IDF queries to users’ responses • ITA judged to be more efficient than SQL Server’s Top-k operator when indexes exist
Conclusions • Clearly, an exciting and worthwhile field • Research has gone in several directions but all shares roots in Fagin and Carey’s work • Combines many areas of computer science • Artificial Intelligence (Fuzzy Logic) • Information Retrieval
The Future • Implementation in major RDBMS vendors • Microsoft should be among the first to revamp their Top-K operator, as in-house research [Agrawal, et al, 2003] has provided a smarter, faster technique • Explore more complex ranking functions that cannot be easily mapped to range queries or used with indexes
References • M. J. Carey and D. Kossmann. On saying “enough already!" in SQL. 1997 SIGMOD Conference: 219-230, 1997. • D. Donjerkovic, R. Ramakrishnan. Probabilistic Optimization of Top N Queries. VLDB 1999: 411-422, 1999. • R. Fagin. Combining Fuzzy Information from Multiple Systems. PODS 1996: 216-226, 1996. • S. Nepal, M. V. Ramakrishna. Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29, 1999. • Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. VLDB 1999: 397-410, 1999. • K.C. Chang, S. Hwang. Minimal Probing: Supporting Expensive Predicates for Top-k Queries. SIGMOD Conference 2002: 346-357, 2002. • Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Automated Ranking of Database Query Results. CIDR 2003, 2003.