1 / 33

Evaluating Top-k Queries over Web-Accessible Databases

Evaluating Top-k Queries over Web-Accessible Databases. Nicolas Bruno Luis Gravano Amélie Marian Columbia University. “Top- k ” Queries Natural in Many Scenarios. Example: NYC Restaurant Recommendation Service. Goal: Find best restaurants for a user: Close to address: “2290 Broadway”

Download Presentation

Evaluating Top-k Queries over Web-Accessible Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University

  2. “Top-k” Queries Natural in Many Scenarios • Example: NYC Restaurant Recommendation Service. • Goal: Find best restaurants for a user: • Close to address: “2290 Broadway” • Price around $25 • Good rating Query: Specification of Flexible Preferences Answer: Best k Objects for Distance Function

  3. Attributes often Handled by External Sources • MapQuest returns the distance between two addresses. • NYTimes Review gives the price range of a restaurant. • Zagat gives a food rating to the restaurant.

  4. “Top-k” Query Processing Challenges • Attributes handled by external sources (e.g., MapQuest distance). • External sources exhibit a variety of interfaces (e.g., NYTimes Review, Zagat). • Existing algorithms do not handle all types of interfaces.

  5. Processing Top-k Queries over Web-Accessible Data Sources • Data and query model • Algorithms for sources with different interfaces • Our new algorithm: Upper • Experimental results

  6. Data Model Top-k Query: assignment of weights and target values to attributes < $25, “2290 Broadway”, very good > close to address preferred price preferred rating weights: <4, 1, 2> Combined in scoring function price: most important attribute

  7. Sorted Access Source S • Return objects sorted by scores for a given query. • Example: Zagat GetNextS interface S-Source Access Time: tS(S)

  8. Random Access Source R • Return the score of a given object for a given query. • Example: MapQuest GetScoreR interface R-Source Access Time: tR(R)

  9. Query Model • Attributes scores between 0 and 1. • Sequential access to sources. • Score Ties broken arbitrarily. • No wild guesses. • One S-Source (or SR-Source) and multiple R-sources. (More on this later.)

  10. Query Processing Goals • Processing top-k queries over R-Sources. • Returning exact answer to top-k query q. • Minimizing query response time. • Naïve solution too expensive (access all sources for all objects).

  11. Example: NYC Restaurants • S-Source: • Zagat: restaurants sorted by food rating. • R-Sources: • MapQuest: distance between two input addresses. User address: “2290 Broadway” • NYTimes Review: price range of the input restaurant. Target Value: $25

  12. TA Algorithm for SR-Sources Fagin, Lotem, and Naor (PODS 2001) • Perform sorted access sequentially to all SR-Sources • Completely probe every object foundfor all attributes using random access. • Keep best k objects. • Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold). Does NOT handle R-Sources

  13. Our Adaptation of TA Algorithm for R-Sources: TA-Adapt • Perform sorted access to S-SourceS. • Probe every R-SourceRi for newly found object. • Keep best k objects. • Stop when scores of best k objects are no less than maximum possible score of unseen objects (threshold).

  14. o1 0.9 0.1 0.5 0.56 o2 0.8 0.7 0.7 0.75 o3 0.45 0.6 0.3 0.55 GetNextS(q) Threshold = 0.95 GetScoreR1(q,o2) Threshold = 0.9 GetScoreR2(q,o1) Threshold = 0.95 GetScoreR2(q,o2) Threshold = 0.9 GetScoreR1(q,o1) Threshold = 0.95 GetScoreR1(q,o3) Threshold = 0.725 GetScoreR2(q,o3) Threshold = 0.725 GetNextS(q) Threshold = 0.9 GetNextS(q) Threshold = 0.725 An Example Execution of TA-Adapt Threshold = 1 Total Execution Time = 9 tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1 Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6

  15. Improvements over TA-Adapt • Add a shortcut test after each random-access probe (TA-Opt). • Exploit techniques for processing selections with expensive predicates (TA-EP). • Reorder accesses to R-Sources. • Best weight/time ratio.

  16. Object is one of top-k objects Object is not one of top-k objects The Upper Algorithm • Selects a pair (object,source) to probe next. • Based on the property: The object with the highest upper bound will be probed before top-k solution is reached.

  17. o1 0.95 0.65 0.9 0.1 o2 0.75 0.8 0.8 0.9 0.8 0.7 0.7 0.75 o3 0.725 0.45 GetScoreR1(q,o2) Threshold = 0.9 GetScoreR2(q, o2) Threshold = 0.725 GetNextS(q) Threshold = 0.95 GetNextS(q) Threshold = 0.725 GetNextS(q) Threshold = 0.9 GetScoreR1(q,o1) Threshold = 0.95 An Example Execution of Upper Threshold = 1 Total Execution Time = 6 tS(S)=tR(R1)=tR(R2)=1, w=<3, 2, 1>, k=1 Final Score = (3.scoreZagat + 2.scoreMQ + 1.scoreNYT)/6

  18. The Upper Algorithm • Choose object with highest upper bound. • If some unseen object can have higher upper bound: • Access S-Source S Else: • Access bestR-Source Ri for chosen object • Keep best k objects • If top-k objects have final values higher than maximum possible value of any other object, return top-k objects. Interleaves accesses on objects

  19. Selecting the Best Source • Upper relies on expected values to make its choices. • Upper computes “best subset” of sources that is expected to: • Compute the final score for k top objects. • Discard other objects as fast as possible. • Upper chooses best source in “best subset”. • Best weight/time ratio.

  20. Experimental Setting: Synthetic Data • Attribute scores randomly generated (three data sets: uniform, gaussian and correlated). • tR(Ri): integer between 1 and 10. • tS(S) {0.1, 0.2,…,1.0}. • Query execution time: ttotal • Default: k=50, 10000 objects, uniform data. • Results: average ttotal of 100 queries. • Optimal assumes complete knowledge (unrealistic, but useful performance bound)

  21. Experiments: Varying Number of Objects Requested k

  22. Experiments: Varying Number of Database Objects N

  23. Experimental Setting: Real Web Data • S-Source: Verizon Yellow Pages (sorted by distance) • R-Sources:

  24. Experiments: Real-Web Data # of Random Accesses

  25. Evaluation Conclusions • TA-EP and TA-Opt much faster than TA-Adapt. • Upper significantly better than all versions of TA. • Upper close to optimal. • Real data experiments: Upper faster than TA adaptations.

  26. Conclusion • Introduced first algorithm for top-k processing over R-Sources. • Adapted TA to this scenario. • Presented new algorithms: Upper and Pick (see paper) • Evaluated our new algorithms with both real and synthetic data. • Upper close to optimal

  27. Current and Future Work • Relaxation of the Source Model • Current source model limited • Any number of R-Sources and SR-Sources • Upperhas good results even with only SR-Sources • Parallelism • Define a query model for parallel access to sources • Adapt our algorithms to this model • Approximate Queries

  28. References • Top-k Queries: • Evaluating Top-k Selection Queries, S. Chaudhuri and L. Gravano. VLDB 1999 • TA algorithm: • Optimal Aggregation Algorithms for Middleware, R. Fagin, A. Lotem, and M. Naor. PODS 2001 • Variations of TA: • Query Processing Issues on Image (Multimedia) Databases, S. Nepal and V. Ramakrishna. ICDE 1999 • Optimizing Multi-Feature Queries for Image Databases, U. Güntzer, W.-T. Balke, and W.Kießling. VLDB 2000 • Expensive Predicates • Predicate Migration: Optimizing queries with Expensive Predicates, J.M. Hellerstein and M. Stonebraker. SIGMOD 1993

  29. Real-web Experiments

  30. Real-web Experiments with Adaptive Time

  31. Relaxing the Source Model TA-EP Upper

  32. Upcoming Journal Paper • Variations of Upper • Select best source • Data Structures • Complexity Analysis • Relaxing Source Model • Adaptation of our Algorithms • New Algorithms • Variations of Data and Query Model to handle real web data

  33. Optimality • TA instance optimal over: • Algorithms that do not make wild guesses. • Databases that satisfy the distinctness property. • TAZ instance optimal over: • Algorithms that do not make wild guesses. • No complexity analysis of our algorithms, but experimental evaluation instead

More Related