330 likes | 464 Views
Evaluating Top-k Queries over Web-Accessible Databases. Nicolas Bruno Luis Gravano Amélie Marian Columbia University. “Top- k ” Queries Natural in Many Scenarios. Example: NYC Restaurant Recommendation Service. Goal: Find best restaurants for a user: Close to address: “2290 Broadway”
E N D
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University
“Top-k” Queries Natural in Many Scenarios • Example: NYC Restaurant Recommendation Service. • Goal: Find best restaurants for a user: • Close to address: “2290 Broadway” • Price around $25 • Good rating Query: Specification of Flexible Preferences Answer: Best k Objects for Distance Function
Attributes often Handled by External Sources • MapQuest returns the distance between two addresses. • NYTimes Review gives the price range of a restaurant. • Zagat gives a food rating to the restaurant.
“Top-k” Query Processing Challenges • Attributes handled by external sources (e.g., MapQuest distance). • External sources exhibit a variety of interfaces (e.g., NYTimes Review, Zagat). • Existing algorithms do not handle all types of interfaces.
Processing Top-k Queries over Web-Accessible Data Sources • Data and query model • Algorithms for sources with different interfaces • Our new algorithm: Upper • Experimental results
Data Model Top-k Query: assignment of weights and target values to attributes < $25, “2290 Broadway”, very good > close to address preferred price preferred rating weights: <4, 1, 2> Combined in scoring function price: most important attribute
Sorted Access Source S • Return objects sorted by scores for a given query. • Example: Zagat GetNextS interface S-Source Access Time: tS(S)
Random Access Source R • Return the score of a given object for a given query. • Example: MapQuest GetScoreR interface R-Source Access Time: tR(R)
Query Model • Attributes scores between 0 and 1. • Sequential access to sources. • Score Ties broken arbitrarily. • No wild guesses. • One S-Source (or SR-Source) and multiple R-sources. (More on this later.)
Query Processing Goals • Processing top-k queries over R-Sources. • Returning exact answer to top-k query q. • Minimizing query response time. • Naïve solution too expensive (access all sources for all objects).
Example: NYC Restaurants • S-Source: • Zagat: restaurants sorted by food rating. • R-Sources: • MapQuest: distance between two input addresses. User address: “2290 Broadway” • NYTimes Review: price range of the input restaurant. Target Value: $25
TA Algorithm for SR-Sources Fagin, Lotem, and Naor (PODS 2001) • Perform sorted access sequentially to all SR-Sources • Completely probe every object foundfor all attributes using random access. • Keep best k objects. • Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold). Does NOT handle R-Sources
Our Adaptation of TA Algorithm for R-Sources: TA-Adapt • Perform sorted access to S-SourceS. • Probe every R-SourceRi for newly found object. • Keep best k objects. • Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold).
o1 0.9 0.1 0.5 0.56 o2 0.8 0.7 0.7 0.75 o3 0.45 0.6 0.3 0.55 GetNextS(q) Threshold = 0.95 GetScoreR1(q,o2) Threshold = 0.9 GetScoreR2(q,o1) Threshold = 0.95 GetScoreR2(q,o2) Threshold = 0.9 GetScoreR1(q,o1) Threshold = 0.95 GetScoreR1(q,o3) Threshold = 0.725 GetScoreR2(q,o3) Threshold = 0.725 GetNextS(q) Threshold = 0.9 GetNextS(q) Threshold = 0.725 An Example Execution of TA-Adapt Threshold = 1 Total Execution Time = 9 tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1 Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
Improvements over TA-Adapt • Add a shortcut test after each random-access probe (TA-Opt). • Exploit techniques for processing selections with expensive predicates (TA-EP). • Reorder accesses to R-Sources. • Best weight/time ratio.
Object is one of top-k objects Object is not one of top-k objects The Upper Algorithm • Selects a pair (object,source) to probe next. • Based on the property: The object with the highest upper bound will be probed before top-k solution is reached.
o1 0.95 0.65 0.9 0.1 o2 0.75 0.8 0.8 0.9 0.8 0.7 0.7 0.75 o3 0.725 0.45 GetScoreR1(q,o2) Threshold = 0.9 GetScoreR2(q, o2) Threshold = 0.725 GetNextS(q) Threshold = 0.95 GetNextS(q) Threshold = 0.725 GetNextS(q) Threshold = 0.9 GetScoreR1(q,o1) Threshold = 0.95 An Example Execution of Upper Threshold = 1 Total Execution Time = 6 tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1 Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6
The Upper Algorithm • Choose object with highest upper bound. • If some unseen object can have higher upper bound: • Access S-Source S Else: • Access bestR-Source Ri for chosen object • Keep best k objects • If top-k objects have final values higher than maximum possible value of any other object, return top-k objects. Interleaves accesses on objects
Selecting the Best Source • Upper relies on expected values to make its choices. • Upper computes “best subset” of sources that is expected to: • Compute the final score for k top objects. • Discard other objects as fast as possible. • Upper chooses best source in “best subset”. • Best weight/time ratio.
Experimental Setting: Synthetic Data • Attribute scores randomly generated (three data sets: uniform, gaussian and correlated). • tR(Ri): integer between 1 and 10. • tS(S) {0.1, 0.2,…,1.0}. • Query execution time: ttotal • Default: k=50, 10000 objects, uniform data. • Results: average ttotal of 100 queries. • Optimal assumes complete knowledge (unrealistic, but useful performance bound)
Experimental Setting: Real Web Data • S-Source: Verizon Yellow Pages (sorted by distance) • R-Sources:
Experiments: Real-Web Data # of Random Accesses
Evaluation Conclusions • TA-EP and TA-Opt much faster than TA-Adapt. • Upper significantly better than all versions of TA. • Upper close to optimal. • Real data experiments: Upper faster than TA adaptations.
Conclusion • Introduced first algorithm for top-k processing over R-Sources. • Adapted TA to this scenario. • Presented new algorithms: Upper and Pick (see paper) • Evaluated our new algorithms with both real and synthetic data. • Upper close to optimal
Current and Future Work • Relaxation of the Source Model • Current source model limited • Any number of R-Sources and SR-Sources • Upperhas good results even with only SR-Sources • Parallelism • Define a query model for parallel access to sources • Adapt our algorithms to this model • Approximate Queries
References • Top-k Queries: • Evaluating Top-k Selection Queries, S. Chaudhuri and L. Gravano. VLDB 1999 • TA algorithm: • Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor. PODS 2001 • Variations of TA: • Query Processing Issues on Image (Multimedia) Databases, S. Nepal and V. Ramakrishna. ICDE 1999 • Optimizing Multi-Feature Queries for Image Databases, U. Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000 • Expensive Predicates • Predicate Migration: Optimizing queries with Expensive Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD 1993
Relaxing the Source Model TA-EP Upper
Upcoming Journal Paper • Variations of Upper • Select best source • Data Structures • Complexity Analysis • Relaxing Source Model • Adaptation of our Algorithms • New Algorithms • Variations of Data and Query Model to handle real web data
Optimality • TA instance optimal over: • Algorithms that do not make wild guesses. • Databases that satisfy the distinctness property. • TAZ instance optimal over: • Algorithms that do not make wild guesses. • No complexity analysis of our algorithms, but experimental evaluation instead