Cooperative XML Query Answering: The CoXML Approach

Cooperative XML (CoXML) Query Answering

Motivation • XML has become the standard format for information representation and data exchange • An explosive increase in the amount of XML data available on the web, e.g., • Bills at the Library of Congress • IEEE Computer Society’s publication • SwissProt – protein sequence databases • XMark – online auction data • …. • Effective XML search methods are needed!

Challenges • XML schema is usually very complex • E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths • It is often unrealistic for users to fully understand a schema before asking queries • Exact query answering is inadequate and approximate query answering is more appropriate!

Derive approximate answers by relaxing query conditions, i.e., query relaxation Query Cooperative XML Query Answering Approximate Answers XML Database Engine XML Documents Approach: CoXML

Roadmap • Introduction • Background • CoXML • Related Work • Conclusion

XML data is often modeled as an ordered labeled tree Tree nodes: elements Tree edges: element-nesting relationships article 1 Element body year title author 7 6 2 3 Search engine spam detection 2003 4 name 5 title 8 section XYZ IEEE Fellow ..a spam detection technique by content analysis… Content XML Data Model

article title year section search engine 2003 spam detection XML Query Model • XML queries are often modeled as trees • Structure conditions: a set of query nodes connected by • Parent-to-child (‘/’): directly connected • Ancestor-to-descendant (‘// ’): connected (either directly or indirectly) • Content conditions: • Either value predicates or keyword constraints on query nodes • Example

article article 1 title year section search engine 2003 spam detection body year title author 7 6 2 3 Search engine spam detection 2003 4 name 5 title 8 section Query Tree XYZ IEEE Fellow ..a spam detection technique by content analysis… Data Tree XML Query Answer • An answer for a query is a set of nodes in a data tree that satisfies both structure and content conditions • Example

article article article title title year year section section title year section search engine search engine 2003 2003 spam detection spam detection search engine 2000-2005 spam detection document title year section search engine 2003 spam detection XML Query Relaxation Types • Value relaxation: enlarging a value condition’s search scope • Node relabel: changing the label a node to a similar or a more general label by domain knowledge [1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)

article article article title year section title year section title year section search engine 2003 spam detection search engine 2003 spam detection search engine 2003 spam detection article search engine year section 2003 spam detection XML Query Relaxation Types • Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge • Node deletion: dropping a node from a query tree

XML Relaxation Properties • Definition • Relaxation operation: an application of a relaxation type to a specific query node or edge • Lemma • Given a query tree with n applicable relaxation operations, there are potentially up to 2n relaxed trees • Possible combinations:

Challenges • Query relaxation is often user-specific • Different users may have different approximate matching specifications for a given query tree • How to provide user-specific approximate query answering? • A query with n relaxation operations has potentially up to 2n relaxed queries • How to systematically relax a query? • Query relaxation generates a set of approximate answers • How to effectively rank the returned approximate answers?

relaxation language RLXQuery similarity metrics ranked results relaxation indexes query Ranking Module Relaxation Engine results relaxed query XTAH Relaxation Index Builder exact answers query XML Documents CoXML XML Database Engine CoXML System Overview

Roadmap • Introduction • Background • CoXML • Relaxation Language • Relaxation Indexes • Ranking • Evaluation • Testbed • Related Work • Conclusion

Relaxation Language • Motivation • Enabling users to specify approximate conditions in queries and to control the approximate matching process • RLXQuery - relaxation-enabled XQuery • Extends the standard XML query language (XQuery) with relaxation constructs & controls, such as • ~ : approximate conditions • ! : non-relaxable conditions • REJECT : unacceptable relaxations • AT-LEAST : minimum # of answers to be returned • RELAX-ORDER : relaxation orders among multiple conditions • USE: allowable relaxation types

t2 t1 RLXQuery Example FOR $a in doc (“bib.xml”)//article WHERE$a/year = ~2003V-COND-LABEL t1 and ~($a[about(./!title, “search engine”)]/body/section)[about(., “spam detection”)] S-COND-LABEL t2 RETURN $a RELAX-ORDER (t1, t2) USE (edge generalization, node deletion) AT-LEAST 20 article ! title body year search engine section 2003 spam detection

Relaxation Index • Naïve approach • Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers • Exhaustive, but not scalable • Observation • Many queries share the same (or similar) tree structures • Our approach: relaxation index • Consider the structure of a query tree T as a template • Build indexes on the relaxed trees of T • Use the index to guide the relaxations of any query with the same (or similar) tree structure as that of T

Relaxation Index - XTAH • XTAH • A hierarchical multi-level labeled cluster of relaxed trees • Building an XTAH • Given a query structure template T, generate all possible relaxed trees • Each relaxed trees uses an unique set of relaxation operations • Cluster relaxed trees into groups based on relaxation operations and distances similar to “suffix-tree” clustering

article relax body T6 T2 article section title body edge_generalization node_relabel node_deletion T7 T4 article article T3 article section title body title body section {gen(e$1,$2)} … {gen(e$3, $4)} ... {del($2)} … section section T1 article … {gen(e$1, $2), gen(e$3, $4)} … {gen(e$3, $4), gen(e$1,$3)} … {del($2), del($3)} title body article $1 section … … … title $2 $3 body section $4 Template structure T XTAH Example A sample XTAH for the template structure T gen(e$u, $v) – relaxing the edge between $u and $v del($u) – deleting the node $u

XTAH Properties • Each group consists of a set of relaxed trees obtained by using similar relaxation operations • Efficient location of relaxed trees based on relaxation operations • The higher level a group, the less relaxed the trees in the group • Relaxing queries at different granularities by traversing up and down the XTAH

XTAH-Guided Query Relaxation • Problem • Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification? • Approach • First, prune XTAH groups containing trees that use unacceptable relaxations as specified in the query • This step can be efficiently achieved by utilizing internal node labels • Then, iteratively search the XTAH for the best relaxed query

relax article $1 Relaxation Control USE (edge generalization, node deletion) AT-LEAST 20 article title $2 $3 edge_generalization node_relabel node_deletion body ! title body year section $4 {gen(e$1,$2)} … {gen(e$3, $4)} ... {del($2)} … search engine section 2003 The template structure, T article spam detection Sample RLXQuery t2 T1 article body … {gen(e$1, $2), gen(e$3, $4)} … {gen(e$3, $4), gen(e$1,$3)} … {del($2), del($3)} T6 T2 article title body section title body section T7 T4 article article T3 article section … … … title body title body section section section t1 Query Relaxation Process Example A sample XTAH for the template structure T

R0 R1 R2 R3 R5 R8 R11 XTAH-Guided Query Relaxation • Problem • Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level? • Approach: M-tree • Assign representatives to internal groups • Representatives summarize distance properties of the trees within groups • Use representatives to guide the search path to the best relaxation candidate relaxed tree j [2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)

Ranking • Ranking criteria • Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes • Approach • Content similarity – extended vector space model • Structure similarity – tree editing distance with a model for assigning operation cost • Overall relevancy – a ranking model combing both content and structure similarities

Content Similarity content similarity between a query and a document Vector Space Model Traditional IR ranking Term Frequency Inverse Document Frequency Weighted Term Frequency Inverse Element Frequency content similarity between a query and an answer (i.e., a set of data nodes) Extended Vector Space Model XML content ranking

5 section section paragraph 6 title 8 spam detection 12 reference Spam Detection By Content Analysis …an approach to detect spam by … Spam detection taxonomy pi: a path under the node v to a term t; m: # of different paths under the node v that contain the term t Weighted Term Frequency • Terms under different paths of a node weight differently • Example • The weighted term frequency for a term t in a node v is: Data Query

$u: a query node whose content condition contains the term t N1: # of data nodes that match the structure condition related to $u N2: # of data nodes that match the structure condition related to $u and contain t Inverse Element Frequency • The more number of XML elements containing a term, the less disambiguating power the term has • E.g., the term “spam” is less disambiguating than the term “detection” • The inverse element frequency for a query term t is

Extended Vector Space Model • The content similarity between an answer A and a query Q is n: # of nodes in Q {$u1, …, $un}: the set of query nodes in Q {v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n) |$ui.cont|: the number of terms in the content conditions on the node $ui tij: a term in the content condition on the query $ui

{r1, …, rk}: the set of relaxation operations used to derive A cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 ) Structure Distance Function • Both XML data and queries are modeled as trees • Similarities between trees are often computed by editing distances, • i.e., the cost of the cheapest sequence of editing operations that transform one tree into the other tree • The structure distance between an answer A and a query Q can be measured as the total cost of relaxation operations used to derive A

Relaxation Operation Cost • Naïve approach • Assign uniform cost to all relaxation operations • Simple but ineffective • Our approach • Assign an operation cost based on the similarity between the two nodes being approximated by the operation • The closer the two nodes, the less the operation costs ri: a relaxation operation $u, $v: the two nodes that are being approximated by ri

article article article document title body title body title body title body section section section Query tree Node Relabel Node deletion Edge generalization Nodes Approximated By Relaxation Operations T1 T2 T3 T4

structure distance content similarity overall relevancy

Overall Relevancy Function • The overall relevancy of an answer A to a query Q, sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q) • Properties • sim(A, Q) = cont_sim(A, Q)ifstruct_dist(A, Q) = 0 • sim(A, Q)as cont_sim(A, Q)  • sim(A, Q)  as struct_dist(A, Q)  • Implementation  is a small constant between 0 and 1

Roadmap • Introduction • Background • CoXML • Relaxation Indexes • Relaxation Language • Ranking • Evaluation • Testbed • Related Work • Conclusion

Evaluation Studies • INEX (Initiative for the evaluation of XML) • Similar to TREC for text retrieval • Document collections • Scientific articles from IEEE Computer Society 1995 – 2002 • About 500MByte • Each article consists of 1500 XML nodes on average • Queries • Strict content and structure (SCAS) • Vague content and structure (VCAS) • Golden standard • Relevance assessment provided by INEX

1 Avg. Precision 0.3309 0.8 Precision 0.6 0.4 0.2 Recall 0 0.5 1 Evaluation of Content Similarity • Datasets: INEX 03 test collection • Query sets: 30 SCAS queries • Comparisons: 38 submissions in INEX 03

Evaluation of the Cost Model • Dataset:INEX 05 test collection • Query set:22 simple VCAS queries • Evaluation metric: normalized extended cumulative gain (nxCG) • the official evaluation metric used in INEX 05 • Given a number i (i1), nxCG@i, similar to precision@i, measures the relative gain users accumulated up to the rank i • E.g., nxCG@10, nxCG@25, nxCG@50, … • Cost Models: • UCost: uniform cost for each relaxation operation (Baseline) • SCost: our proposed cost model

Retrieval performance improvements with semantic cost model • Query set: all content-and-structure queries in INEX 05 nxCG@10 (, cost model) Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results

Evaluation of the Cost Model • Result Each cell: nxCG@10 for a given pair (, cost model) (% of improvement over the baseline) Utilizing node similarities to distinguish costs of different operations improves retrieval performance! Similar results are observed using nxCG@25 and nxCG@50

Expressiveness of the Relaxation Language • INEX 05 Topic 267 • Expressing Topic 267 using RLXQuery <inex_topic topic_id="267" query_type="CAS" > <castitle>//article//fm//atl[about(., "digital libraries")]</castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly.Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative> </inex_topic> FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atlREJECT(fm, bb) WHERE$b[about(., “digital libraries”)] RETURN $b

Perfect accuracy Effectiveness of the Relaxation Control • Expressing Topic 267 with RLXQuery • Results FOR $a in doc(“inex.xml”)//article LET $b = $a//fm//!atl REJECT(fm, bb) WHERE$b[about(., “digital libraries”)] RETURN $b Relaxation control enables the system to provide answers with greater relevancy!

Evaluation of the Ranking Function • Dataset:INEX 05 test collection • Query set:4 official VCAS queries with available relevance assessments • Comparison:top-1 submission in INEX 05 • Results The systematic relaxation approach enables our system to derive more approximate answers! Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only!

Roadmap • Introduction • Background • CoXML • Relaxation Indexes – XTAH • Relaxation Language – RLXQuery • Ranking • Evaluation • Testbed • Related Work • Conclusion

XTAH CoXML Testbed RLXQuery Relaxation Controller Approximate Answers RLXQuery Parser RLXQuery Preprocessor Database Manager Relaxation Manager Ranking Module Relaxation Index Builder XML Database Engine XML Documents Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian

Relaxation Examples using the Testbed

Cooperative XML Query Answering: The CoXML Approach