310 likes | 505 Views
Schema Matching. Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Ma ßmann Putting Context into Schema Matching Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, Michael Flaster COMA - A System for Flexible Combination of Schema Matching Approaches Hongai-Hai Do, Erhard Rahm. Goals.
E N D
Schema Matching Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Maßmann Putting Context into Schema Matching Philip Bohannon, Eiman Elnahrawy, Wenfei Fan, Michael Flaster COMA - A System for Flexible Combination of Schema Matching Approaches Hongai-Hai Do, Erhard Rahm Christiano Santiago
Goals • Introductory concepts on Schema Matching • Context-Sensitive versus Context-Insensitive • Complexity on XSD schemas Christiano Santiago
Agenda • Terminology • Different Approaches • XML Schema Definition • Context-Insensitive • Context-Sensitive • Q&A Christiano Santiago
Terminology • Schema matching: it is the process of identifying that two objects are semantically related. • Mapping: it refers to the transformations between the objects. Meaning Conversion Christiano Santiago
Terminology Student Name, SSN, Level, Major, Marks GradStudent Name, ID, Major, Grades Christiano Santiago
Schema Matching Christiano Santiago
Context Context-insensitive Context-sensitive Christiano Santiago
Different Approaches • Schema-level matchers • Instance-level matchers • Hybrid matchers • Reusing matching information Christiano Santiago
Schema-Level Matchers • Only consider schema information • Name • Description • Data type • Relationship • Constraints • Number of nesting levels Christiano Santiago
Instance-Level Matchers • Use instance-level to gather insight into the content and meaning of schema elements • Linguistic • Dept • DeptName • EmpName • Constraints • 416-7362100 • M3J1P3 Christiano Santiago
Hybrid-Level Matchers • Combines more than one approach Christiano Santiago
Reusing Matching Information • Use previous matching information for future matching tasks • Structures or substructures often repeat • Caution • Salary & Income • Payroll • Tax Reporting Christiano Santiago
XML Schema Definition (XSD) • Data types • 19 built-in primitive data types • 25 built-in derived data types • User defined complex types Christiano Santiago
XML Schema Definition (XSD) • Complex type definition: <complexType name="myNewNameType"> <complexContent> <restriction base="anyType"> <sequence> <element name="name" type="string" /> <element name="location" type="string" /> </sequence> <attribute name="position" type="string" /> </restriction> </complexContent> </complexType> <element name="employee" type="dc:myNewNameType" /> <dc:employee position="trainer"> <dc:name>Don Smith</dc:name> <dc:location>Dallas, TX</dc:location> </dc:employee> Child Elements Attribute Christiano Santiago
XML Schema Definition (XSD) • Shared schema components Christiano Santiago
XML Schema Definition (XSD) • Match Systems approaches • COMA: path-based • Cupid: materialized • Scalability issue: XCBL Order schema contains 1451 components, including 91 shared types. After resolving the shared components, 26000+ nodes/paths were identified. Christiano Santiago
XML Schema Definition (XSD) • Distributed schemas • XSD allows a schema to be distributed over several schema documents (.xsd files) and namespaces Christiano Santiago
XML Schema Definition (XSD) Determining similarity between and matching complex types can be as difficult as matching two complete schemas. Christiano Santiago
Standard Schema Matching Context-Insensitive • Matchers • Matching algorithms to compute similarity scores between a pair of attributes • Weights • Scores are weighted • Confidence scores are identified based on standard statistical techniques • Selection of best matches Christiano Santiago
Fragmented-Based Schema Matching Context-Insensitive • Fragment identification • Identifying fragment-pair candidates • Fragment matching • Result combination Christiano Santiago
Prototype • Based on COMA: COmbining MAtch algorithm • Support to multiple file schema • Multiple matching strategies • Fragment-based approach • Result combination Christiano Santiago
COMA • Schema representation • Schemas are represented by rooted DAGs (Directed Acyclic Graphs). Christiano Santiago
COMA • Directed Acyclic Graphs • Direct graph • With no cycles • Part tree & part graph • Used in Critical Path Analysis,Expression Tree Evaluation and Game Evaluation Christiano Santiago
COMA • Match processing reusability Christiano Santiago
Continuity of this work • 2004: COMA prototype • 2005: COMA++, extended previous COMA prototype • High quality and fast execution times • Default combination of 4 matchers • 2007: MOMA: Mapping-based Object Matching Christiano Santiago
Context Schema MatchingContext-Sensitive • False Negatives RS.price.prcode = “reg” Rs.price.price → RT.music.price Rs.price.price → RT.music.sale RS.price.prcode = “sale” Christiano Santiago
Context Schema MatchingContext-Sensitive • Two techniques for selecting contextual matches: • MultiTable: find the single match with the highest confidence for every target attribute • QualTable: find the best matches on a per-table basis Christiano Santiago
Context Schema MatchingContext-Sensitive • Experimental Results “Because of its poor performance, MultiTable is not considered further” Christiano Santiago
Conclusion • Current schema matching approaches still have to improve for large and complex schemas. • The large search space increases the likelihood for false matches as well as execution times. • Further difficulties for schema matching are posed by the high expressive power and versatility of modern schema languages like XSD. Christiano Santiago
Questions Christiano Santiago