370 likes | 513 Views
A Hybrid Approach for XML Similarity. Joe TEKLI Richard CHBEIR Kokou YETONGNON. Overview. Introduction and motivation Current Solutions Proposal Implementation Conclusion. Introduction and motivation. XML (eXtensable Markup Language)
E N D
A Hybrid Approach for XML Similarity Joe TEKLI Richard CHBEIR Kokou YETONGNON
Overview • Introduction and motivation • Current Solutions • Proposal • Implementation • Conclusion
Introduction and motivation • XML (eXtensable Markup Language) • Major means for efficient data representation and management. • An XML document comes down to an Ordered Labeled Tree • Example : <Academy> <Department > <Laboratory> <Professor>Martin R.</Professor> <Student>Roberts J.</Student> </Laboratory> </Department> </Academy> Plan : Introduction Current Solutions Proposal Implementation Conclusion Node depths 1 0 Academy 1 2 Department 2 3 Laboratory Professor Student 3 4 5
Introduction and motivation • XML has become inevitable • Current applications: • Information description, storage and retrieval • Database information interchange • Web services interaction Plan : Introduction Current Solutions Proposal Implementation Conclusion Information destined to be broadcasted over the web is henceforth represented using XML
Introduction and motivation Emergent need Plan : Introduction Current Solutions Proposal Implementation Conclusion Information retrieval : XML documents comparison XML
Introduction and motivation • A range of algorithms for comparing semi-structured data, e.g. XML documents, have been proposed: • Generally exploit the concept of Edit Distance • Focus on the structure of XML documents • Ignore the semantics involved Plan : Introduction Current Solutions Proposal Implementation Conclusion However, in the field of information retrieval (IR), estimating semantic similarity between web pages is of key importance to improving search results [1] [1] Maguitman A. G., Menczer F., Roinestad H. and Vespignani A., Algorithmic Detection of Semantic Similarity. In Proceedings of the 14th International World Wide Web Conference, 107-116, Chiba, Japan, 2005
Introduction and motivation • Example : • <Factory> • <Department > • <Laboratory> • <Supervisor></Supervisor> • </Laboratory> • </Department> • </Factory> • <Academy> • <Department > • <Laboratory> • <Professor></Professor> • <Student></Student> • </Laboratory> • </Department> • </Academy> • <College> • <Department > • <Laboratory> • <Lecturer></Lecturer> • </Laboratory> • </Department> • </College> Plan : Introduction Current Solutions Proposal Implementation Conclusion XMLDocument B XMLDocument C XMLDocument A Academy Academy College College Factory Factory Departement Departement Departement Laboratory Laboratory Laboratory Lecturer Professeur Professor Student Student Lecturer Supervisor Supervisor ? Sim(A, B) = Sim(A, C)
Introduction and motivation Motivation Plan : Introduction Current Solutions Proposal Implementation Conclusion How to enhance existing XML comparison approaches in order to take into consideration both structural and semantic characteristics of XML documents ?
Introduction and motivation • We consider heterogeneous XML documents, lacking predefined grammars (DTDs or XML Schemas) • XML documents published on the web often found without grammars Goal Plan : Introduction Current Solutions Proposal Implementation Conclusion To put forward an improved XML comparison method integrating semantic and structural similarity
Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion
Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion
Current solutions XML structural similarity • Most algorithms proposed in the literature utilize the programming techniques for finding the Edit Distance between trees Plan : Introduction Current Solutions Proposal Implementation Conclusion • Finding the cheapest sequence of edit operations that can • transform one tree into another
Current solutions XML structural similarity • Algorithms can be distinguished following: • The set of edit operations allowed • Insert node:Insertion of inner/leaf nodes • Delete node: Deletion of inner/leaf nodes • Update node: Relabelling nodes • Insert tree • Delete tree • Move tree • The overall complexity and performance • O(N2 D2) • O(N2) • O(N log(N)) Plan : Introduction Current Solutions Proposal Implementation Conclusion a b e y a c z z h d i j Optimality
Overview • Introduction and motivation • Current solutions • XML structural similarity • Semantic similarity • Proposal • Implementation • Conclusion
Current Solutions Semantic Similarity • Knowledge bases (Thesauri, taxonomies, ontologies) provide a framework for organizing words into a semantic space • Semantic similarity between two words: • Similarity between corresponding concepts in the knowledge base Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept c1 Word/expression c3 c2 c4 c5 c9 c8 c7 c6
Current Solutions Semantic similarity • Several methods are proposed in the litterature: • Edge-based approaches • Node-based approaches • Node-based approaches seem more relevant • Experimental results yield higher correlation with human judgment Plan : Introduction Current Solutions Proposal Implementation Conclusion Concept Word/expression c1 c3 c2 c4 c5 Information content of a concept c = - Log p(c) c9 c8 c7 c6
Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion
Proposal Hybrid approach Plan : Introduction Current Solutions Proposal Implementation Conclusion Edit Distance algorithm (Chawathe [2]) Semantic cost model (Lin [3]) [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998
Proposal • We adopt Chawathe’s Edit Distance algorithm [2] • It’s a direct application of Wagner-Fisher [4] • It is among the fastest available • Edit operations used: • Insertion of leaf nodes - Ins(x, i, p, λ(x)) • Deletion of leaf nodes - Del(x, p) • Update internal/leaf nodes - Upd(x, y) • Complexity: • O(N2) Plan : Introduction Current Solutions Proposal Implementation Conclusion [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [4] Wagner J. and Fisher M., The String-to-String correction problem. Journal of the Association of Computing Machinery, 21(1):168-173, 1974
Proposal • Intuitive cost model: • CostIns = 1 • CostDel = 1 • CostUpd = 1 when x.l ≠ y.l otherwise CostUpd = 0 A central question in most edit distance approaches: Plan : Introduction Current Solutions Proposal Implementation Conclusion ? How to assign edit operations costs
Proposal • Applying Chawathe’s approach [2] XMLDocument B XMLDocument C XMLDocument A Plan : Introduction Current Solutions Proposal Implementation Conclusion Academy College College Factory 1 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 4 4 4 Professor Lecturer Student 5 Lecturer Supervisor Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = Dist(A, B) = Dist(A, C) = 3 How can Semantic Similarity be taken into account Sim = 1 / 1 + Dist ? Sim(A, B) = Sim(A, C) = 0.25
Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying costs w.r.t. corresponding node depths • CostDepth_Op(x) Solution We propose to vary edit operations costs according to the semantics of concerned nodes Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y) CostDepth_Op(x) [0, 1]
Proposal Label semantic similarity cost • Edit operations: • CostSem_Upd(x, y) = 1 – SimSem(x.l, y.l) • CostSem_Ins(x, i, p, λ(x)) = 1 – SimSem(λ(x), p.l) • CostSem_Del(x, p) = 1 – SimSem(x.l, p.l) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostSem_Op when SimSem CostSem_Op when SimSem
Proposal Label semantic similarity cost • Semantic similarity measure adopted: Lin [3] • SimSim(C1, C2) = with C the lowest common ancestor of C1 and C2 (maximizing their pair-wise similarity value) Plan : Introduction Current Solutions Proposal Implementation Conclusion 2 log P(C) log P(C1) + log P(C2) SimSem(C1, C2) [0, 1] [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998
Proposal Label semantic similarity cost • Example • CostSem_Upd(A[1], B[1]) = 1 – SimSem(Academy, College) • CostSem_Upd(A[1], C[1]) = 1 – SimSem(Academy, Factory) • SimSem(Academy, College) > SimSem(Academy, Factory) • CostSem_Upd(A[1], B[1]) < CostSem_Upd(A[1], C[1]) • Dist(A, B) < Dist(A, C) Plan : Introduction Current Solutions Proposal Implementation Conclusion Upd(A[1], B[1]), Upd(A[4], B[4]), Del(A[5], A[3]) Edit script = XMLDocument B XMLDocument C XMLDocument A Academy College Factory 1 College 1 1 2 2 2 Departement Departement Departement 3 3 Laboratory Laboratory Laboratory 3 Sim(A, B) > Sim(A, C) 4 4 4 Professor Student 5 Lecturer Supervisor Lecturer
Proposal • Semantic cost model: • Varying operations costs w.r.t. the semantic relatedness of node labels • CostSem_Op(x, y) • Varying operations costs w.r.t. the node depths • CostDepth_Op(x) Plan : Introduction Current Solutions Proposal Implementation Conclusion CostOp(x, y) = CostSem_Op(x, y) CostDepth_Op(x) [0, 1]
Proposal Node depth cost • CostDepth_Op(x) = 1 / (1 + x.d) Є [0, 1] • Information becomes increasingly specific as one descends in the XML tree hierarchy • Its semantic affect on the whole XML document decreasing accordingly • Editing the root node of a document tree • CostDepth_Op(racine) = 1 • Operations costs decrease when moving downward in the hierarchy Plan : Introduction Current Solutions Proposal Implementation Conclusion Document XML A Academy Hospital Department Laboratory Professor Student
Proposal Plan : Introduction Current Solutions Proposal Implementation Conclusion Proposal Edit Distance computations (Chawathe [2]) Semantic similarity evaluation (Lin [3]) Hybrid XML Comparison Approach [2] Chawathe S., Comparing Hierarchical Data in Extended Memory. In Proceedings of the Twenty-fifth International Conference on Very Large Data Bases. Edinburgh, Scotland, U.K., p. 90-101, 1999 [3] Lin D., Am Information-Theoretic Definition of Similarity. In Proceedings of the 15th International Conference on Machine Learning, 296-304, Morgan Kaufmann Publishers Inc., 1998
Overview • Introduction and motivation • Current solutions • Proposal • Implementation • Conclusion
Implementation • Prototype XS3(XML Structure and Semantic Similarity) • XML documents comparison • 1/1 • 1/∞: ranking documents according to their similarity degrees • ∞/∞: XML documents classification/clustering Plan : Introduction Current Solutions Proposal Implementation Conclusion
Implementation • Synthetic XML documents generator • Producing sets of XML documents based on given DTDs • Taxonomic analyzer • Computing semantic similarity values between words in a given knowledge base (taxonomy) Plan : Introduction Current Solutions Proposal Implementation Conclusion
Implementation • Experimental results • Higher average similarity values, underlining similarities (of semantic nature) that were previously undetected • Straight distinction between documents corresponding to different DTDs • Capturing semantic affinities between document sets Plan : Introduction Current Solutions Proposal Implementation Conclusion <!DOCTYPE DTD2 [ <!ELEMENT School (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Educator?, Scholar*)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Scholar (#PCDATA)> ]> 0.099 0.097 <!DOCTYPE DTD1 [ <!ELEMENT Academy (Administrative unit+)> <!ELEMENT Administrative unit (Branch?)> <!ELEMENT Branch (Educator?, Student+)> <!ELEMENT Educator (#PCDATA)> <!ELEMENT Student (#PCDATA)> ]> 0.095 0.093 0.091 0.089 <!DOCTYPE DTD3 [ <!ELEMENT Government (Administrative unit+)> <!ELEMENT Administrative unit (Section?)> <!ELEMENT Section (Professional?, Worker+)> <!ELEMENT Professional (#PCDATA)> <!ELEMENT Worker (#PCDATA)> ]> 0.087 0.085 Combined structural and semantic similarity Structural similarity
Implementation • Experimental results • Chawathe’s classical Edit Distance process [2] being linear in the number of nodes of each tree O(|A| |B|) Plan : Introduction Current Solutions Proposal Implementation Conclusion Our approach is of polynomial complexity Number of nodes in each taxonomy Time (m) Time (s)
Overview • Introduction and motivation • Current approaches • Proposal • Implementation • Conclusion
Conclusion • Goal: developing an integrated semantic an structure based XML similarity approach, for comparing XML documents, taking into account: • Semantic meaning of XML elements/attributes w.r.t. their labels and depths • Structural characteristics of XML documents • This is the first attempt to combine Edit Distance structural similarity computations with IR semantic similarity assessment, in an XML context • Experimental results are satisfactory Plan : Introduction Current Solutions Proposal Implementation Conclusion
Conclusion • Future work • Exploiting semantic similarity to compare, not only the structure of XML documents, but also their information content (values) • In such a framework, XML Schemas seem unsurpassable • Studying XML similarity in a multimedia context (MPEG7, SVG, ...) • Taking into consideration structural, semantic, as well as multimedia-specific criterion Plan : Introduction Current Solutions Proposal Implementation Conclusion • <Factory> • <Department> • <Laboratory> • <Product> BMW Z3 </Product> • <Product> BMW X5 </Product> • </Laboratory> • </Department> • </Factory>
Thank you Questions …