1 / 37

A Hybrid Approach for XML Similarity

A Hybrid Approach for XML Similarity. Joe TEKLI Richard CHBEIR Kokou YETONGNON. Overview. Introduction and motivation Current Solutions Proposal Implementation Conclusion. Introduction and motivation. XML (eXtensable Markup Language)

reia
Download Presentation

A Hybrid Approach for XML Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Hybrid Approach for XML Similarity Joe TEKLI Richard CHBEIR Kokou YETONGNON

  2. Overview • Introduction and motivation • Current Solutions • Proposal • Implementation • Conclusion

  3. Introduction and motivation • XML (eXtensable Markup Language) • Major means for efficient data representation and management. • An XML document comes down to an Ordered Labeled Tree • Example : <Academy> <Department > <Laboratory> <Professor>Martin R.</Professor> <Student>Roberts J.</Student> </Laboratory> </Department> </Academy> Plan : Introduction Current Solutions Proposal Implementation Conclusion Node depths 1 0 Academy 1 2 Department 2 3 Laboratory Professor Student 3 4 5

  4. Introduction and motivation • XML has become inevitable • Current applications: • Information description, storage and retrieval • Database information interchange • Web services interaction Plan : Introduction Current Solutions Proposal Implementation Conclusion Information destined to be broadcasted over the web is henceforth represented using XML

  5. Introduction and motivation Emergent need Plan : Introduction Current Solutions Proposal Implementation Conclusion Information retrieval : XML documents comparison XML

  6. Introduction and motivation • A range of algorithms for comparing semi-structured data, e.g. XML documents, have been proposed: • Generally exploit the concept of Edit Distance • Focus on the structure of XML documents • Ignore the semantics involved Plan : Introduction Current Solutions Proposal Implementation Conclusion However, in the field of information retrieval (IR), estimating semantic similarity between web pages is of key importance to improving search results [1] [1] Maguitman A. G., Menczer F., Roinestad H. and Vespignani A., Algorithmic Detection of Semantic Similarity. In Proceedings of the 14th International World Wide Web Conference, 107-116, Chiba, Japan, 2005

  7. Introduction and motivation • Example : • <Factory> • <Department > • <Laboratory> • <Supervisor></Supervisor> • </Laboratory> • </Department> • </Factory> • <Academy> • <Department > • <Laboratory> • <Professor></Professor> • <Student></Student> • </Laboratory> • </Department> • </Academy> • <College> • <Department > • <Laboratory> • <Lecturer></Lecturer> • </Laboratory> • </Department> • </College> Plan : Introduction Current Solutions Proposal Implementation Conclusion XMLDocument B XMLDocument C XMLDocument A Academy Academy College College Factory Factory Departement Departement Departement Laboratory Laboratory Laboratory Lecturer Professeur Professor Student Student Lecturer Supervisor Supervisor ? Sim(A, B) = Sim(A, C)

  8. Introduction and motivation Motivation Plan : Introduction Current Solutions Proposal Implementation Conclusion How to enhance existing XML comparison approaches in order to take into consideration both structural and semantic characteristics of XML documents ?

  9. Introduction and motivation • We consider heterogeneous XML documents, lacking predefined grammars (DTDs or XML Schemas) • XML documents published on the web often found without grammars Goal Plan : Introduction Current Solutions Proposal Implementation Conclusion To put forward an improved XML comparison method integrating semantic and structural similarity

  10. Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion

  11. Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion

  12. Current solutions XML structural similarity • Most algorithms proposed in the literature utilize the programming techniques for finding the Edit Distance between trees Plan : Introduction Current Solutions Proposal Implementation Conclusion • Finding the cheapest sequence of edit operations that can • transform one tree into another

  13. Current solutions XML structural similarity • Algorithms can be distinguished following: • The set of edit operations allowed • Insert node:Insertion of inner/leaf nodes • Delete node: Deletion of inner/leaf nodes • Update node: Relabelling nodes • Insert tree • Delete tree • Move tree • The overall complexity and performance • O(N2 D2) • O(N2) • O(N log(N)) Plan : Introduction Current Solutions Proposal Implementation Conclusion a b e y a c z z h d i j Optimality

  14. Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion

  15. Current Solutions Semantic Similarity • Knowledge bases (Thesauri, taxonomies, ontologies) provide a framework for organizing words into a semantic space • Semantic similarity between two words: • Similarity between corresponding concepts in the knowledge base Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept c1 Word/expression c3 c2 c4 c5 c9 c8 c7 c6

  16. Current Solutions Semantic similarity • Several methods are proposed in the litterature: • Edge-based approaches • Node-based approaches • Node-based approaches seem more relevant • Experimental results yield higher correlation with human judgment Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept Word/expression c1 c3 c2 c4 c5 Information content of a concept c = - Log p(c) c9 c8 c7 c6

  17. Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion

  18. Proposal Hybrid approach Plan : Introduction Current Solutions Proposal Implementation Conclusion Edit Distance algorithm (Chawathe [2]) Semantic cost model (Lin [3]) [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

  19. Proposal • We adopt Chawathe’s Edit Distance algorithm [2] • It’s a direct application of Wagner-Fisher [4] • It is among the fastest available • Edit operations used: • Insertion of leaf nodes - Ins(x, i, p, λ(x)) • Deletion of leaf nodes - Del(x, p) • Update internal/leaf nodes - Upd(x, y) • Complexity: • O(N2) Plan : Introduction Current Solutions Proposal Implementation Conclusion [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [4] Wagner J. and Fisher M., The String-to-String correction problem. Journal of the Association of Computing Machinery, 21(1):168-173, 1974

  20. Proposal • Intuitive cost model: • CostIns = 1 • CostDel = 1 • CostUpd = 1 when x.l ≠ y.l otherwise CostUpd = 0 A central question in most edit distance approaches: Plan : Introduction Current Solutions Proposal Implementation Conclusion ? How to assign edit operations costs

  21. Proposal • Applying Chawathe’s approach [2] XMLDocument B XMLDocument C XMLDocument A Plan : Introduction Current Solutions Proposal Implementation Conclusion Academy College College Factory 1 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 4 4 4 Professor Lecturer Student 5 Lecturer Supervisor Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = Dist(A, B) = Dist(A, C) = 3 How can Semantic Similarity be taken into account Sim = 1 / 1 + Dist ? Sim(A, B) = Sim(A, C) = 0.25

  22. Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying costs w.r.t. corresponding node depths • CostDepth_Op(x) Solution We propose to vary edit operations costs according to the semantics of concerned nodes Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y)  CostDepth_Op(x)  [0, 1]

  23. Proposal Label semantic similarity cost • Edit operations: • CostSem_Upd(x, y) = 1 – SimSem(x.l, y.l) • CostSem_Ins(x, i, p, λ(x)) = 1 – SimSem(λ(x), p.l) • CostSem_Del(x, p) = 1 – SimSem(x.l, p.l) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostSem_Op  when SimSem CostSem_Op when SimSem 

  24. Proposal Label semantic similarity cost • Semantic similarity measure adopted: Lin [3] • SimSim(C1, C2) = with C the lowest common ancestor of C1 and C2 (maximizing their pair-wise similarity value) Plan : Introduction Current Solutions Proposal Implementation Conclusion 2 log P(C) log P(C1) + log P(C2) SimSem(C1, C2)  [0, 1] [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

  25. Proposal Label semantic similarity cost • Example • CostSem_Upd(A[1], B[1]) = 1 – SimSem(Academy, College) • CostSem_Upd(A[1], C[1]) = 1 – SimSem(Academy, Factory) • SimSem(Academy, College) > SimSem(Academy, Factory) • CostSem_Upd(A[1], B[1]) < CostSem_Upd(A[1], C[1]) • Dist(A, B) < Dist(A, C) Plan : Introduction Current Solutions Proposal Implementation Conclusion Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = XMLDocument B XMLDocument C XMLDocument A Academy College Factory 1 College 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 Sim(A, B) > Sim(A, C) 4 4 4 Professor Student 5 Lecturer Supervisor Lecturer

  26. Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying operations costs w.r.t. the node depths • CostDepth_Op(x) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y)  CostDepth_Op(x)  [0, 1]

  27. Proposal Node depth cost • CostDepth_Op(x) = 1 / (1 + x.d) Є [0, 1] • Information becomes increasingly specific as one descends in the XML tree hierarchy • Its semantic affect on the whole XML document decreasing accordingly • Editing the root node of a document tree • CostDepth_Op(racine) = 1 • Operations costs decrease when moving downward in the hierarchy Plan : Introduction Current Solutions Proposal Implementation Conclusion Document XML A Academy Hospital Department Laboratory Professor Student

  28. Proposal Plan : Introduction Current Solutions Proposal Implementation Conclusion Proposal Edit Distance computations (Chawathe [2]) Semantic similarity evaluation (Lin [3]) Hybrid XML Comparison Approach [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998

  29. Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion

  30. Implementation • Prototype XS3(XML Structure and Semantic Similarity) • XML documents comparison • 1/1 • 1/∞: ranking documents according to their similarity degrees • ∞/∞: XML documents classification/clustering Plan : Introduction Current Solutions Proposal Implementation Conclusion

  31. Implementation • Synthetic XML documents generator • Producing sets of XML documents based on given DTDs • Taxonomic analyzer • Computing semantic similarity values between words in a given knowledge base (taxonomy) Plan : Introduction Current Solutions Proposal Implementation Conclusion

  32. Implementation • Experimental results • Higher average similarity values, underlining similarities (of semantic nature) that were previously undetected • Straight distinction between documents corresponding to different DTDs • Capturing semantic affinities between document sets Plan : Introduction Current Solutions Proposal Implementation Conclusion <!DOCTYPE DTD2 [ <!ELEMENT School (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Educator?, Scholar*)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Scholar (#PCDATA)> ]> 0.099 0.097 <!DOCTYPE DTD1 [ <!ELEMENT Academy (Administrative unit+)> <!ELEMENT Administrative unit (Branch?)> <!ELEMENT Branch (Educator?, Student+)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Student (#PCDATA)> ]> 0.095 0.093 0.091 0.089 <!DOCTYPE DTD3 [ <!ELEMENT Government (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Professional?, Worker+)> <!ELEMENT Professional (#PCDATA)> <!ELEMENT Worker (#PCDATA)> ]> 0.087 0.085 Combined structural and semantic similarity Structural similarity

  33. Implementation • Experimental results • Chawathe’s classical Edit Distance process [2] being linear in the number of nodes of each tree O(|A| |B|) Plan : Introduction Current Solutions Proposal Implementation Conclusion Our approach is of polynomial complexity Number of nodes in each taxonomy Time (m) Time (s)

  34. Overview • Introduction and motivation • Current approaches • Proposal • Implementation • Conclusion

  35. Conclusion • Goal: developing an integrated semantic an structure based XML similarity approach, for comparing XML documents, taking into account: • Semantic meaning of XML elements/attributes w.r.t. their labels and depths • Structural characteristics of XML documents • This is the first attempt to combine Edit Distance structural similarity computations with IR semantic similarity assessment, in an XML context • Experimental results are satisfactory Plan : Introduction Current Solutions Proposal Implementation Conclusion

  36. Conclusion • Future work • Exploiting semantic similarity to compare, not only the structure of XML documents, but also their information content (values) • In such a framework, XML Schemas seem unsurpassable • Studying XML similarity in a multimedia context (MPEG7, SVG, ...) • Taking into consideration structural, semantic, as well as multimedia-specific criterion Plan : Introduction Current Solutions Proposal Implementation Conclusion • <Factory> • <Department> • <Laboratory> • <Product> BMW Z3 </Product> • <Product> BMW X5 </Product> • </Laboratory> • </Department> • </Factory>

  37. Thank you Questions …

More Related