A Hybrid Approach for XML Similarity

A Hybrid Approach for XML Similarity Joe TEKLI Richard CHBEIR Kokou YETONGNON

Overview • Introduction and motivation • Current Solutions • Proposal • Implementation • Conclusion

Introduction and motivation • XML (eXtensable Markup Language) • Major means for efficient data representation and management. • An XML document comes down to an Ordered Labeled Tree • Example : <Academy> <Department > <Laboratory> <Professor>Martin R.</Professor> <Student>Roberts J.</Student> </Laboratory> </Department> </Academy> Plan : Introduction Current Solutions Proposal Implementation Conclusion Node depths 1 0 Academy 1 2 Department 2 3 Laboratory Professor Student 3 4 5

Introduction and motivation • XML has become inevitable • Current applications: • Information description, storage and retrieval • Database information interchange • Web services interaction Plan : Introduction Current Solutions Proposal Implementation Conclusion Information destined to be broadcasted over the web is henceforth represented using XML

Introduction and motivation Emergent need Plan : Introduction Current Solutions Proposal Implementation Conclusion Information retrieval : XML documents comparison XML

Introduction and motivation • A range of algorithms for comparing semi-structured data, e.g. XML documents, have been proposed: • Generally exploit the concept of Edit Distance • Focus on the structure of XML documents • Ignore the semantics involved Plan : Introduction Current Solutions Proposal Implementation Conclusion However, in the field of information retrieval (IR), estimating semantic similarity between web pages is of key importance to improving search results [1] [1] Maguitman A. G., Menczer F., Roinestad H. and Vespignani A., Algorithmic Detection of Semantic Similarity. In Proceedings of the 14th International World Wide Web Conference, 107-116, Chiba, Japan, 2005

Introduction and motivation • Example : • <Factory> • <Department > • <Laboratory> • <Supervisor></Supervisor> • </Laboratory> • </Department> • </Factory> • <Academy> • <Department > • <Laboratory> • <Professor></Professor> • <Student></Student> • </Laboratory> • </Department> • </Academy> • <College> • <Department > • <Laboratory> • <Lecturer></Lecturer> • </Laboratory> • </Department> • </College> Plan : Introduction Current Solutions Proposal Implementation Conclusion XMLDocument B XMLDocument C XMLDocument A Academy Academy College College Factory Factory Departement Departement Departement Laboratory Laboratory Laboratory Lecturer Professeur Professor Student Student Lecturer Supervisor Supervisor ? Sim(A, B) = Sim(A, C)

Introduction and motivation Motivation Plan : Introduction Current Solutions Proposal Implementation Conclusion How to enhance existing XML comparison approaches in order to take into consideration both structural and semantic characteristics of XML documents ?

Introduction and motivation • We consider heterogeneous XML documents, lacking predefined grammars (DTDs or XML Schemas) • XML documents published on the web often found without grammars Goal Plan : Introduction Current Solutions Proposal Implementation Conclusion To put forward an improved XML comparison method integrating semantic and structural similarity

Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion

Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion

Current solutions XML structural similarity • Most algorithms proposed in the literature utilize the programming techniques for finding the Edit Distance between trees Plan : Introduction Current Solutions Proposal Implementation Conclusion • Finding the cheapest sequence of edit operations that can • transform one tree into another

Current solutions XML structural similarity • Algorithms can be distinguished following: • The set of edit operations allowed • Insert node:Insertion of inner/leaf nodes • Delete node: Deletion of inner/leaf nodes • Update node: Relabelling nodes • Insert tree • Delete tree • Move tree • The overall complexity and performance • O(N2 D2) • O(N2) • O(N log(N)) Plan : Introduction Current Solutions Proposal Implementation Conclusion a b e y a c z z h d i j Optimality

Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion

Current Solutions Semantic Similarity • Knowledge bases (Thesauri, taxonomies, ontologies) provide a framework for organizing words into a semantic space • Semantic similarity between two words: • Similarity between corresponding concepts in the knowledge base Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept c1 Word/expression c3 c2 c4 c5 c9 c8 c7 c6

Current Solutions Semantic similarity • Several methods are proposed in the litterature: • Edge-based approaches • Node-based approaches • Node-based approaches seem more relevant • Experimental results yield higher correlation with human judgment Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept Word/expression c1 c3 c2 c4 c5 Information content of a concept c = - Log p(c) c9 c8 c7 c6

Proposal Hybrid approach Plan : Introduction Current Solutions Proposal Implementation Conclusion Edit Distance algorithm (Chawathe [2]) Semantic cost model (Lin [3]) [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

Proposal • We adopt Chawathe’s Edit Distance algorithm [2] • It’s a direct application of Wagner-Fisher [4] • It is among the fastest available • Edit operations used: • Insertion of leaf nodes - Ins(x, i, p, λ(x)) • Deletion of leaf nodes - Del(x, p) • Update internal/leaf nodes - Upd(x, y) • Complexity: • O(N2) Plan : Introduction Current Solutions Proposal Implementation Conclusion [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [4] Wagner J. and Fisher M., The String-to-String correction problem. Journal of the Association of Computing Machinery, 21(1):168-173, 1974

Proposal • Intuitive cost model: • CostIns = 1 • CostDel = 1 • CostUpd = 1 when x.l ≠ y.l otherwise CostUpd = 0 A central question in most edit distance approaches: Plan : Introduction Current Solutions Proposal Implementation Conclusion ? How to assign edit operations costs

Proposal • Applying Chawathe’s approach [2] XMLDocument B XMLDocument C XMLDocument A Plan : Introduction Current Solutions Proposal Implementation Conclusion Academy College College Factory 1 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 4 4 4 Professor Lecturer Student 5 Lecturer Supervisor Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = Dist(A, B) = Dist(A, C) = 3 How can Semantic Similarity be taken into account Sim = 1 / 1 + Dist ? Sim(A, B) = Sim(A, C) = 0.25

Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying costs w.r.t. corresponding node depths • CostDepth_Op(x) Solution We propose to vary edit operations costs according to the semantics of concerned nodes Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y)  CostDepth_Op(x)  [0, 1]

Proposal Label semantic similarity cost • Edit operations: • CostSem_Upd(x, y) = 1 – SimSem(x.l, y.l) • CostSem_Ins(x, i, p, λ(x)) = 1 – SimSem(λ(x), p.l) • CostSem_Del(x, p) = 1 – SimSem(x.l, p.l) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostSem_Op  when SimSem CostSem_Op when SimSem 

Proposal Label semantic similarity cost • Semantic similarity measure adopted: Lin [3] • SimSim(C1, C2) = with C the lowest common ancestor of C1 and C2 (maximizing their pair-wise similarity value) Plan : Introduction Current Solutions Proposal Implementation Conclusion 2 log P(C) log P(C1) + log P(C2) SimSem(C1, C2)  [0, 1] [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

Proposal Label semantic similarity cost • Example • CostSem_Upd(A[1], B[1]) = 1 – SimSem(Academy, College) • CostSem_Upd(A[1], C[1]) = 1 – SimSem(Academy, Factory) • SimSem(Academy, College) > SimSem(Academy, Factory) • CostSem_Upd(A[1], B[1]) < CostSem_Upd(A[1], C[1]) • Dist(A, B) < Dist(A, C) Plan : Introduction Current Solutions Proposal Implementation Conclusion Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = XMLDocument B XMLDocument C XMLDocument A Academy College Factory 1 College 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 Sim(A, B) > Sim(A, C) 4 4 4 Professor Student 5 Lecturer Supervisor Lecturer

Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying operations costs w.r.t. the node depths • CostDepth_Op(x) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y)  CostDepth_Op(x)  [0, 1]

Proposal Node depth cost • CostDepth_Op(x) = 1 / (1 + x.d) Є [0, 1] • Information becomes increasingly specific as one descends in the XML tree hierarchy • Its semantic affect on the whole XML document decreasing accordingly • Editing the root node of a document tree • CostDepth_Op(racine) = 1 • Operations costs decrease when moving downward in the hierarchy Plan : Introduction Current Solutions Proposal Implementation Conclusion Document XML A Academy Hospital Department Laboratory Professor Student

Proposal Plan : Introduction Current Solutions Proposal Implementation Conclusion Proposal Edit Distance computations (Chawathe [2]) Semantic similarity evaluation (Lin [3]) Hybrid XML Comparison Approach [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

Implementation • Prototype XS3(XML Structure and Semantic Similarity) • XML documents comparison • 1/1 • 1/∞: ranking documents according to their similarity degrees • ∞/∞: XML documents classification/clustering Plan : Introduction Current Solutions Proposal Implementation Conclusion

Implementation • Synthetic XML documents generator • Producing sets of XML documents based on given DTDs • Taxonomic analyzer • Computing semantic similarity values between words in a given knowledge base (taxonomy) Plan : Introduction Current Solutions Proposal Implementation Conclusion

Implementation • Experimental results • Higher average similarity values, underlining similarities (of semantic nature) that were previously undetected • Straight distinction between documents corresponding to different DTDs • Capturing semantic affinities between document sets Plan : Introduction Current Solutions Proposal Implementation Conclusion <!DOCTYPE DTD2 [ <!ELEMENT School (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Educator?, Scholar*)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Scholar (#PCDATA)> ]> 0.099 0.097 <!DOCTYPE DTD1 [ <!ELEMENT Academy (Administrative unit+)> <!ELEMENT Administrative unit (Branch?)> <!ELEMENT Branch (Educator?, Student+)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Student (#PCDATA)> ]> 0.095 0.093 0.091 0.089 <!DOCTYPE DTD3 [ <!ELEMENT Government (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Professional?, Worker+)> <!ELEMENT Professional (#PCDATA)> <!ELEMENT Worker (#PCDATA)> ]> 0.087 0.085 Combined structural and semantic similarity Structural similarity

Implementation • Experimental results • Chawathe’s classical Edit Distance process [2] being linear in the number of nodes of each tree O(|A| |B|) Plan : Introduction Current Solutions Proposal Implementation Conclusion Our approach is of polynomial complexity Number of nodes in each taxonomy Time (m) Time (s)

Overview • Introduction and motivation • Current approaches • Proposal • Implementation • Conclusion

Conclusion • Goal: developing an integrated semantic an structure based XML similarity approach, for comparing XML documents, taking into account: • Semantic meaning of XML elements/attributes w.r.t. their labels and depths • Structural characteristics of XML documents • This is the first attempt to combine Edit Distance structural similarity computations with IR semantic similarity assessment, in an XML context • Experimental results are satisfactory Plan : Introduction Current Solutions Proposal Implementation Conclusion

Conclusion • Future work • Exploiting semantic similarity to compare, not only the structure of XML documents, but also their information content (values) • In such a framework, XML Schemas seem unsurpassable • Studying XML similarity in a multimedia context (MPEG7, SVG, ...) • Taking into consideration structural, semantic, as well as multimedia-specific criterion Plan : Introduction Current Solutions Proposal Implementation Conclusion • <Factory> • <Department> • <Laboratory> • <Product> BMW Z3 </Product> • <Product> BMW X5 </Product> • </Laboratory> • </Department> • </Factory>

Thank you Questions …

A Hybrid Approach for XML Similarity

A Hybrid Approach for XML Similarity

Presentation Transcript

A Hybrid Approach for Searching in the Semantic Web

A Hybrid Optimization Approach for Automated Parameter Estimation Problems

A Flexible XML-Based Glossary Approach for the Federal Government

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity

JiTT at witt : a hybrid approach

A Hybrid Approach for the Automated Finishing of Bacterial Genomes

A scalable Approach to Size-independent Network Similarity

A Hybrid Optimization Approach for Global Exploration

A Hybrid Column Generation Approach for the Berth Allocation Problem

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries

Schema Advisor for Hybrid Relational-XML DBMS

A Scalable Approach to Size-Independent Network Similarity

VQEG – JEG Hybrid Model Input XML

A Similarity Skyline Approach for Handling Graph Queries - A Preliminary Report

A Hybrid Optimization Approach for Automated Parameter Estimation Problems

Extending PRIX for Similarity-based XML Query

A probabilistic XML approach to data integration

Learning Element Similarity for XML Document Clustering

SIMILARITY SEARCH The Metric Space Approach

A Hybrid Match Algorithm for XML Schemas

PCI v. CABG for multivessel disease: Time for a hybrid approach?

A Study of Hybrid Similarity Measures for Semantic Relation Extraction