Extending PRIX for Similarity-based XML Query

Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

Agenda • System Architecture Introduction • Semantic-based Similarity Search • Query Expansion • Semantic Similarity Computation • Structural-based Similarity Search • Adapting PRIX algorithm • Indexing • Query Processing • Structural Similarity Computation • Similarity Computation and Ranking • Discussion & Conclusion

System Architecture Introduction

Query Expansion (I) An Example: Tags in a sample query {title, Praveen Rao, information retrieval} Keywords {title, Praveen, Rao, information, retrieval} Keyword Extensions {{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} Valid Keyword Extensions {{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}} (Continue in next page)

Tag Extensions • {{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}} • Valid Tag Extensions • {{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}} • Query Expansions • { {title}, {Praveen Rao}, {modern information retrieval} } • {A claim on theory of computation}， {Praveen Rao}, {modern information retrieval} } …… • Valid Queries • { {title}, {Praveen Rao}, {modern information retrieval} } Query Expansion (II)

Semantic Similarity Computation • Similarity between query q and one of its extensions q’ t: tag in query q t’: tag in query q’ n: number of tags in q = 1, if ki= ki’ α (0 =< α <1), if ki <> ki’ m: number of keywords in tag t

Indexing: Prix (PRüfer sequences for Indexing Xml)

AD-Label (Ancestor-Descendant) Indexing structure in DB Indexing: Prix (PRüfer sequences for Indexing Xml)

Query Processing • Procedure • Filtering • Based on Subsequence matching • O (n*n*m) : n is the number of nodes in the document; m is the number of nodes in the query. • Refinement • Connectivity • Gap Consistency • Frequency Consistency

Subsequence Matching • Definition - Example: * Good results: media, mult, mm, ted, tia, etc… • Why it works? • Is not enough, need more refinements…

Concept of Dummy Nodes - PRIX offers only partial match - Solution: extend prix to leaves level - Example: Refinement #1

Connection vs Connectionless - Definition - How to check it? - If not connected, then what? - Solution: apply penalty Example (Disconnected By Gap): Example (Disconnected By Unknown): Refinement #2

Refinement #3 • Checking for Gap Consistency - Gap Consistency depends on gaps of prüfer sequence - How to check it? - Determines if query tree is subset of searching domain

Refinement #4 • Checking for Frequency Consistency - Frequency consistency depends on Gap Consistency and occurrences of NPS - How to check it? - Determines if query tree is exact match in searching domain - If not frequency consistent, then what? - Solution: apply penalty

Structure Similarity • Calculations are based on edit distances which transforms to penalty values • Each mismatch node in structure has penalty equal to size of subtree + 1 • Overall penalty is dot product of all mismatches • All results are normalized with respect to worst case penalty • Overall penalty is dot product of all mismatches • All results are normalized with respect to worst case penalty

Structural Similarity #1: Connectivity

Structural Similarity #2: Gap Similarity

Structural Similarity #3:Frequency Similarity

Rank returned XML patterns Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

Advantages of the approach • Prix Indexing • Faster • Captures all structural information • Similarity based • Structure similarity • Semantic similarity

Limitations and Extensions • Limitation of Prix: • Ordering of nodes • We need to handle it in query extension a a b c c b baca caba

Limitations and Extensions • More Limitations of Prix: • It is difficult to map intuitive structure similarities in tree to sequences similarities in Prix sequences • thus difficult to have accurate definitions of the similarity • However: • Translate tree structures to equivalent sequences and further do data mining or similarity matching on sequences is a promising direction

Limitations and Extensions • Limitations of Semantic similarity • Too many similar results • However: • We consider semantic similarity together with structure information • In broad sense: • Structure similarity • Semantic similarity • Syntax similarity • Similarity information from co-occurrences of keywords • Similarity information from user feedback • Similarity information from metadata (DTD, data source, region, language, link structure of XML files, etc.)

Extending PRIX for Similarity-based XML Query