Structure and Content Scoring for XML

Structure and Content Scoring for XML Amélie Marian (Columbia University) Joint work with: Sihem Amer-Yahia (AT&T Research Labs) Nick Koudas (University of Toronto) Divesh Srivastava (AT&T Research Labs) David Toman (University of Waterloo)

book info edition (paperback) author (Dickens) title (Great Expectations) book info edition (paperback) author (Dickens) title (Great Expectations) Motivations: XML Data Heterogeneity book book Data Heterogeneous XML Data about books • Query: book[./info[./title=“Great Expectations” and ./author=“Dickens”] and ./edition=“paperback”] info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Query root node: Distinguished node Amélie Marian - Columbia University

book book info edition (paperback) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) XML Query Relaxation Query [Amer-Yahia et al. EDBT’02] • Tree pattern relaxations: • Leaf node deletion • Edge generalization • Subtree promotion book book Data edition? info info author (Dickens) title (Great Expectations) edition (paperback) title (Great Expectations) author (Dickens) Amélie Marian - Columbia University

Motivations • Top-k query processing suitable for relaxed XML queries over heterogeneous collections • Return k XML nodes that are closest to query structure • Opportunity for more efficient query processing • Need scoring mechanism to identify best k answers Amélie Marian - Columbia University

Contributions • Scoring mechanism for XML queries • Data structures for top-k query processing • Experimental evaluation Amélie Marian - Columbia University

Scoring Functions Critical for Top-k Query Processing • Top-k answer quality depends on scoring function • Efficient top-k query processing requires scoring function: • Monotonic • Fast to compute • Little attention given to scoring functions for structured and semi-structured data • Extensively studied over text data (e.g., tf.idf) • Proposed scoring function inspired by tf.idf for XML data Amélie Marian - Columbia University

Adaptation of tf.idf to XML Queries Amélie Marian - Columbia University

Required properties: Exact matches should be scored higher than relaxed matches (idf) Distinguished nodes with several matches should be ranked higher than those with fewer matches (tf) How to combine tf and idf? tf.idf, as used by IR, violates above properties Ranking based on idf, then breaking ties using tf satisfies the properties book book info edition (paperback) edition (paperback) info author (Dickens) title (Great Expectations) title (Great Expectations) Scoring Function for XML Approximate Matches book book info info edition (paperback) edition (paperback) author (Dickens) title (Great Expectations) (a) (b) score(a) >= score(b) score(a) <= score(b) Amélie Marian - Columbia University

Twig predicate High quality Expensive computation Path predicates Binary predicates Low quality Fast computation book book book + book + book book + book + book + book info info edition (paperback) edition (paperback) info info edition (paperback) author (Dickens) title (Great Expectations) info edition (paperback) author (Dickens) author (Dickens) title (Great Expectations) title (Great Expectations) author (Dickens) title (Great Expectations) A Family of Scoring Methods for XML Path Queries Query Amélie Marian - Columbia University

a b d c e Matrix Representation of Twigs • Twigs (queries and tuples) can be represented by matrices that capture all relationships in the query: Partial Tuple: Query: a1 (not joined with e yet) (no matches for e) (e1 matches) b1 d1 c1 e1 // X X / = X X X X X Matrix subsumption used to compare tuple and queries Amélie Marian - Columbia University

a b a a c b b c c a a a c c b b b c Representing Relaxed Query Patterns: DAG Structure • Each child is more relaxed (has more matches) than its parent • idf of a child is no higher than the idf of its parent • idf scores are accessible in constant time for any match (complete or partial) using hash function a b a a Exhaustive algorithm to build the DAG c b a Amélie Marian - Columbia University

idf score information: idf=(1+|a|)/(1+|ap|), where |ap| is the number of a nodes that satisfy the query predicate For query processing: Best possible score from here Best possible score after each remaining join operations Number of matches (useful for tf) a b a a c b b c c a a a c c b b b c Information stored in the DAG 1.228 1.2 1.195 1.167 1.195 a 1.167 1.156 b a a 1.049 1.156 c b a 1 Amélie Marian - Columbia University

Query Processing using the DAG • Benefits: • Score computation done in a preprocessing phase (using exact or approximate information) • Score access during query processing done in constant time • Additional information needed for query processing precomputed and accessed in constant time (e.g., score upper bound) • tf estimated at runtime based on available information Amélie Marian - Columbia University

Quality/Space/Time tradeoff • Binary Predicates • Smaller DAG (O(4q)) • Faster pre-processing (and processing) • Lower Quality (fewer possible scores) • Path Predicates and Twig • DAG is O(4q^2/2)) in space (still reasonable in practice) • More pre-processing • Higher Quality (more differences between scores) (For some queries, some (or all) techniques may have the same results) Amélie Marian - Columbia University

Experimental Setup • Data: • Synthetic heterogeneous document collections generated with Toxgene • Real dataset: Wall Street Journal Treebank corpora • Pregenerated queries exhibiting different sizes, query structures and predicates • Measures: • DAG size • DAG preprocessing time • Query processing time • Precision (percentage of top-k answers that are actual top-k answers, as given by Twig) Amélie Marian - Columbia University

XML Scoring Precision Amélie Marian - Columbia University

XML Scoring Preprocessing Time Amélie Marian - Columbia University

XML Scoring Real data Amélie Marian - Columbia University

Conclusions • Scoring method for XML queries • Inspired from tf.idf • Accounts for structure and content • Accounts for structural relaxations • Efficient data structures to compute and access scores during top-k query processing • DAG • Matrix representation of queries and tuples • Evaluation of the scoring methods tradeoffs • Answer qualitu vs. preprocessing time Amélie Marian - Columbia University

Related Work • IR Scoring • Content only • XML Scoring • Content with structure • XIRQL [XML&IR’00], JuruXML [SIGIR’03], IR-CADG [WebDB’04] • None of these techniques account for structural relaxations (with the exception of our previous work [ICDE’05]) • XML Structural Relaxation • FleXPath [SIGMOD’04], Kanza and Sagiv [PODS’01], Schlieder [EDBT’02], Delobel and Rousset [FMII’01] Amélie Marian - Columbia University

Future Work • Streaming scenarios • Incremental updates on DAG • Approximate scoring • Integration with approximate text scoring • Extend proposed XML scoring function to handle text content approximation (e.g., misspellings) • Unify structure and content score • Quality evaluation (INEX) Amélie Marian - Columbia University

Structure and Content Scoring for XML

Structure and Content Scoring for XML

Presentation Transcript

XML Indexing Structure

Structure and Content Scoring for XML

XML-Based Content Management Framework for Digital Museum

Indexing and Searching XML Documents based on Content and Structure Synopses

Structure and content (key points)

Persuasive content and Persuasive Structure

High Accuracy Scoring Functions for Computational Protein Structure Refinement

XML file structure

Searching in an XML Corpus Using Content and Structure INEX 2003, Germany

Tools for Memory: Semantic Content (XML)

Data File Structure and Content

Document Content Description for XML, Version 1.0

NCQA PPC-PCMH Content and Scoring

XML: Separating presentation from content

Structure/XML Retrieval

Structure Indexes for XML

The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search

Translating XML and XLIFF Structured Content

XML as Content Management

Structure and Content of ECAC Website

XML, DITA and Content Repurposing

The Role of Document Structure in Querying, Scoring and Evaluating XML Full-Text Search