210 likes | 222 Views
This book explores the challenges and strategies for efficiently identifying and ranking approximate matches to XPath queries. It introduces the Whirlpool architecture, which allows for adaptive routing and early pruning of non-top-k partial matches.
E N D
Relax and Adapt: Computing Top-k Matches to XPath Queries Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research)
book info edition (paperback) author (Dickens) title (Great Expectations) book info edition (paperback) author (Dickens) title (Great Expectations) Example book book • Heterogeneous XML Data about books • Query: book[./info/title=“Great Expectations”] and [./info/author=“Dickens”] and [./edition=“paperback”] info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Query root node: Distinguished node Amélie Marian - Columbia University
book book info edition (paperback) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion book book Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Amélie Marian - Columbia University
Top-k Queries over XML Data:Motivations and Challenges • Structure heterogeneity • Efficient identification of approximate matches • Top-k • Ranking of approximate matches based on similarity to query • Early pruning • Query processing cost • Cost increases with number of matches evaluated • Data explosion • Many approximate matches • XML path queries akin to joins • Prioritization to increase pruning Amélie Marian - Columbia University
Contributions • Whirlpool: adaptive architecture and top-k query processing strategy for XPath queries • Goal: early pruning of non-top-k partial matches • Approach: partial matches may follow different plans, and may be at different stages of query execution • Real prototype implementation of Whirlpool • Instantiation of Whirlpool for various “routing strategies” and “prioritization” alternatives Amélie Marian - Columbia University
Closely Related Work • Adaptive query processing • Eddies: • Dynamic query join plans to adapt to processing environment • No pruning • Adaptive top-k query processing • Upper: • Prioritization of partial matches based on maximum possible scores • Adaptive routing based on scores • No joins [Avnur and Hellerstein. SIGMOD’00] [Bruno et al. ICDE’01] Amélie Marian - Columbia University
Outline • Whirlpool Architecture • Query Processing • Strategy • Alternatives • Evaluation Settings • Evaluation Results Amélie Marian - Columbia University
Whirlpool Architecture book info edition (paperback) Router author (Dickens) title (Great Expectations) book server edition server title server info server author server Top-k Set Amélie Marian - Columbia University
Whirlpool Architecture:Components • Top-k Set • Only one match with a given root node • Used for pruning • Complete matches are not processed further, incomplete matches are sent to the router • Router • Router Queue is based on partial matches maximum possible final scores • Dynamically choose which server to send partial match based on routing strategy Amélie Marian - Columbia University
Whirlpool Architecture:Components • Root server: • Generates candidate matches • Node servers: • Maintain priority queue of partial matches • For each partial match that is processed: • Compute a set of extended partial (or complete) matches • Compute scores of new matches • Checks partial matches against current top-k set Amélie Marian - Columbia University
Query Processing Alternatives • Prioritization Strategies (at each server) • FIFO • Current Score • Maximum Possible Next Score • Maximum Possible Final Score • Routing Decisions (at the router) • Static • Score-based • Likely to increase score the most • Likely to increase score the least • Size-based • Likely to produce the fewest matches Amélie Marian - Columbia University
Evaluation Strategies • Lockstep (Static) • Partial matches follow same execution plan • Partial matches have gone through exactly the same number of operations • Whirlpool Single-threaded (Adaptive) • Partial matches adaptively routed • Process the partial match with the highest maximum final score (Query processing similar to Upper) • Only one partial match processed at a time • Whirlpool Multi-threaded (Adaptive) • Prioritization strategy at server decides which partial match to process next at server • System determines which server to process next Amélie Marian - Columbia University
Evaluation Metrics • Parameters: • Query size • Document size • k • Parallelism • Scoring function (tf.idf proposed in paper) • Measures: • Query execution time • Number of server operations • Number of partial matches created Amélie Marian - Columbia University
Evaluation Setting • C++ implementation, with POSIX threads • Default machine: • Red Hat 7.1 Linux • 1.4GHz dual processor • 2Gb RAM • XML Documents generated using XMark generating tool • XPath Queries chosen from XMark to illustrate different relaxations • XML nodes stored using Dewey encoding Amélie Marian - Columbia University
Comparison of Adaptive Routing Strategies Whirlpool-S and Whirlpool-M perform approximately the same number of server operations Amélie Marian - Columbia University
Static Routing Strategies vs. Best Adaptive Amélie Marian - Columbia University
Effect of Parallelism Amélie Marian - Columbia University
Varying Query Size and k (log scale) 60% 48% 20% For large queries and high values of k, Whirlpool-M performs less server operations that Whirlpool-S (and is faster even on a one-processor machine)! (27% less server operations for q3 k=75) Amélie Marian - Columbia University
Varying Query Size and Document Size Almost twice as fast Amélie Marian - Columbia University
Scalability Percentage of partial matches created by Whirlpool-M as a function of the maximum possible number of partial matches Amélie Marian - Columbia University
Conclusions • Efficient adaptive top-k query processing strategy • Minimize number of partial matches evaluated • Benefit from parallelism with little threading overhead • Adapt to different environments • Score distribution • Selectivity distribution • Extensive experimental evaluation • Good scalability Amélie Marian - Columbia University