1 / 21

Relax and Adapt: Computing Top -k Matches to XPath Queries

This book explores the challenges and strategies for efficiently identifying and ranking approximate matches to XPath queries. It introduces the Whirlpool architecture, which allows for adaptive routing and early pruning of non-top-k partial matches.

Download Presentation

Relax and Adapt: Computing Top -k Matches to XPath Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relax and Adapt: Computing Top-k Matches to XPath Queries Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research)

  2. book info edition (paperback) author (Dickens) title (Great Expectations) book info edition (paperback) author (Dickens) title (Great Expectations) Example book book • Heterogeneous XML Data about books • Query: book[./info/title=“Great Expectations”] and [./info/author=“Dickens”] and [./edition=“paperback”] info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Query root node: Distinguished node Amélie Marian - Columbia University

  3. book book info edition (paperback) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion book book Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Amélie Marian - Columbia University

  4. Top-k Queries over XML Data:Motivations and Challenges • Structure heterogeneity • Efficient identification of approximate matches • Top-k • Ranking of approximate matches based on similarity to query • Early pruning • Query processing cost • Cost increases with number of matches evaluated • Data explosion • Many approximate matches • XML path queries akin to joins • Prioritization to increase pruning Amélie Marian - Columbia University

  5. Contributions • Whirlpool: adaptive architecture and top-k query processing strategy for XPath queries • Goal: early pruning of non-top-k partial matches • Approach: partial matches may follow different plans, and may be at different stages of query execution • Real prototype implementation of Whirlpool • Instantiation of Whirlpool for various “routing strategies” and “prioritization” alternatives Amélie Marian - Columbia University

  6. Closely Related Work • Adaptive query processing • Eddies: • Dynamic query join plans to adapt to processing environment • No pruning • Adaptive top-k query processing • Upper: • Prioritization of partial matches based on maximum possible scores • Adaptive routing based on scores • No joins [Avnur and Hellerstein. SIGMOD’00] [Bruno et al. ICDE’01] Amélie Marian - Columbia University

  7. Outline • Whirlpool Architecture • Query Processing • Strategy • Alternatives • Evaluation Settings • Evaluation Results Amélie Marian - Columbia University

  8. Whirlpool Architecture book info edition (paperback) Router author (Dickens) title (Great Expectations) book server edition server title server info server author server Top-k Set Amélie Marian - Columbia University

  9. Whirlpool Architecture:Components • Top-k Set • Only one match with a given root node • Used for pruning • Complete matches are not processed further, incomplete matches are sent to the router • Router • Router Queue is based on partial matches maximum possible final scores • Dynamically choose which server to send partial match based on routing strategy Amélie Marian - Columbia University

  10. Whirlpool Architecture:Components • Root server: • Generates candidate matches • Node servers: • Maintain priority queue of partial matches • For each partial match that is processed: • Compute a set of extended partial (or complete) matches • Compute scores of new matches • Checks partial matches against current top-k set Amélie Marian - Columbia University

  11. Query Processing Alternatives • Prioritization Strategies (at each server) • FIFO • Current Score • Maximum Possible Next Score • Maximum Possible Final Score • Routing Decisions (at the router) • Static • Score-based • Likely to increase score the most • Likely to increase score the least • Size-based • Likely to produce the fewest matches Amélie Marian - Columbia University

  12. Evaluation Strategies • Lockstep (Static) • Partial matches follow same execution plan • Partial matches have gone through exactly the same number of operations • Whirlpool Single-threaded (Adaptive) • Partial matches adaptively routed • Process the partial match with the highest maximum final score (Query processing similar to Upper) • Only one partial match processed at a time • Whirlpool Multi-threaded (Adaptive) • Prioritization strategy at server decides which partial match to process next at server • System determines which server to process next Amélie Marian - Columbia University

  13. Evaluation Metrics • Parameters: • Query size • Document size • k • Parallelism • Scoring function (tf.idf proposed in paper) • Measures: • Query execution time • Number of server operations • Number of partial matches created Amélie Marian - Columbia University

  14. Evaluation Setting • C++ implementation, with POSIX threads • Default machine: • Red Hat 7.1 Linux • 1.4GHz dual processor • 2Gb RAM • XML Documents generated using XMark generating tool • XPath Queries chosen from XMark to illustrate different relaxations • XML nodes stored using Dewey encoding Amélie Marian - Columbia University

  15. Comparison of Adaptive Routing Strategies Whirlpool-S and Whirlpool-M perform approximately the same number of server operations Amélie Marian - Columbia University

  16. Static Routing Strategies vs. Best Adaptive Amélie Marian - Columbia University

  17. Effect of Parallelism Amélie Marian - Columbia University

  18. Varying Query Size and k (log scale) 60% 48% 20% For large queries and high values of k, Whirlpool-M performs less server operations that Whirlpool-S (and is faster even on a one-processor machine)! (27% less server operations for q3 k=75) Amélie Marian - Columbia University

  19. Varying Query Size and Document Size Almost twice as fast Amélie Marian - Columbia University

  20. Scalability Percentage of partial matches created by Whirlpool-M as a function of the maximum possible number of partial matches Amélie Marian - Columbia University

  21. Conclusions • Efficient adaptive top-k query processing strategy • Minimize number of partial matches evaluated • Benefit from parallelism with little threading overhead • Adapt to different environments • Score distribution • Selectivity distribution • Extensive experimental evaluation • Good scalability Amélie Marian - Columbia University

More Related