230 likes | 459 Views
Structure and Content Scoring for XML. Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo). book. info. edition (paperback). author
E N D
Structure and Content Scoring for XML Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo)
book info edition (paperback) author (Dickens) title (Great Expectations) book info edition (paperback) author (Dickens) title (Great Expectations) Motivations: XML Data Heterogeneity book book Data Heterogeneous XML Data about books • Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Query root node: Distinguished node Amélie Marian - Columbia University
book book info edition (paperback) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion book book Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Amélie Marian - Columbia University
Motivations • Top-k query processing suitable for relaxed XML queries over heterogeneous collections • Return k XML nodes that are closest to query structure • Opportunity for more efficient query processing • Need scoring mechanism to identify best k answers Amélie Marian - Columbia University
Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University
Scoring Functions Critical for Top-k Query Processing • Top-k answer quality depends on scoring function • Efficient top-k query processing requires scoring function: • Monotonic • Fast to compute • Little attention given to scoring functions for structured and semi-structured data • Extensively studied over text data (e.g., tf.idf) • Proposed scoring function inspired by tf.idf for XML data Amélie Marian - Columbia University
Adaptation of tf.idf to XML Queries Amélie Marian - Columbia University
Required properties: Exact matches should be scored higher than relaxed matches (idf) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties book book info edition (paperback) edition (paperback) info author (Dickens) title (Great Expectations) title (Great Expectations) Scoring Function for XML Approximate Matches book book info info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) (a) (b) score(a) >= score(b) score(a) <= score(b) Amélie Marian - Columbia University
Twig predicate High quality Expensive computation Path predicates Binary predicates Low quality Fast computation book book book + book + book book + book + book + book info info edition (paperback) edition (paperback) info info edition (paperback) author (Dickens) title (Great Expectations) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) author (Dickens) title (Great Expectations) A Family of Scoring Methods for XML Path Queries Query Amélie Marian - Columbia University
Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University
a b d c e Matrix Representation of Twigs • Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: Partial Tuple: Query: a1 (not joined with e yet) (no matches for e) (e1 matches) b1 d1 c1 e1 // X X / = X X X X X Matrix subsumption used to compare tuple and queries Amélie Marian - Columbia University
a b a a c b b c c a a a c c b b b c Representing Relaxed Query Patterns: DAG Structure • Each child is more relaxed (has more matches) than its parent • idf of a child is no higher than the idf of its parent • idf scores are accessible in constant time for any match (complete or partial) using hash function a b a a Exhaustive algorithm to build the DAG c b a Amélie Marian - Columbia University
idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) a b a a c b b c c a a a c c b b b c Information stored in the DAG 1.228 1.2 1.195 1.167 1.195 a 1.167 1.156 b a a 1.049 1.156 c b a 1 Amélie Marian - Columbia University
Query Processing using the DAG • Benefits: • Score computation done in a preprocessing phase (using exact or approximate information) • Score access during query processing done in constant time • Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) • tf estimated at runtime based on available information Amélie Marian - Columbia University
Quality/Space/Time tradeoff • Binary Predicates • Smaller DAG (O(4q)) • Faster pre-processing (and processing) • Lower Quality (fewer possible scores) • Path Predicates and Twig • DAG is O(4q^2/2)) in space (still reasonable in practice) • More pre-processing • Higher Quality (more differences between scores) Amélie Marian - Columbia University
Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University
Experimental Setup • Data: • Synthetic heterogeneous document collections generated with Toxgene • Real dataset: Wall Street Journal Treebank corpora • Pregenerated queries exhibiting different sizes, query structures and predicates • Measures: • DAG size • DAG preprocessing time • Query processing time • Precision (percentage of top-k answers that are actual top-k answers, as given by Twig) Amélie Marian - Columbia University
XML Scoring Precision Amélie Marian - Columbia University
XML Scoring Preprocessing Time Amélie Marian - Columbia University
XML Scoring Real data Amélie Marian - Columbia University
Conclusions • Scoring method for XML queries • Inspired from tf.idf • Accounts for structure and content • Accounts for structural relaxations • Efficient data structures to compute and access scores during top-k query processing • DAG • Matrix representation of queries and tuples • Evaluation of the scoring methods tradeoffs • Answer quality vs. preprocessing time Amélie Marian - Columbia University
Related Work • IR Scoring • Content only • XML Scoring • Content with structure • XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] • None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) • XML Structural Relaxation • FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] Amélie Marian - Columbia University
Future Work • Streaming scenarios • Incremental updates on DAG • Approximate scoring • Integration with approximate text scoring • Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) • Unify structure and content score • Quality evaluation (INEX) Amélie Marian - Columbia University