1 / 27

CSE 636 Data Integration

CSE 636 Data Integration. Schema Matching Cupid. Fall 2006. Virtual Integration Architecture. Wrapper. Wrapper. Design-Time. Run-Time. . Schema Matching. Query Reformulation. Query. Result. End User. Mediation Language. Optimization & Execution. Mediator. Global Schema.

goro
Download Presentation

CSE 636 Data Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 636Data Integration Schema Matching Cupid Fall 2006

  2. Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time  Schema Matching Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services XML Data Source Data Source Local Schema Local Schema

  3. Schema Heterogeneity Independently created schemas… … might be modeling similar information… … in slightly different ways DB1 DB2 DB3 ugrad * course * student * ugradID courseID studentID title name name type enrollment * student * courseID studentID course * ugradID name courseID grade type title ? evaluation type letter

  4. Schema Heterogeneity • Similar entities represented • Dissimilar structures (inverted nesting) • Different element names for similar data values • Similar element names for different data values DB1 DB2 DB3 ugrad * course * student * ugradID courseID studentID title name name type enrollment * student * courseID studentID course * ugradID name courseID grade type title ? evaluation type letter

  5. Schema Matching vs. Schema Mapping • GAV and LAV are schema mapping languages • Mappings: • set of queries • associations + semantics • Match: • set of associations only • Schema Matching: • Identifying associations • First step towards constructing mappings

  6. Schema Matching vs. Schema Mapping DB1 DB3 ugrad * student * ugradID studentID name name type enrollment * courseID course * ugradID courseID grade title ? type letter Semantics for $s1 in DB3/student where $s1/type = ‘UGRAD’ return <DB1> <ugrad> <ugradID>{$s1/studentID}</ugradID> <name>{$s1/name}</name> </ugrad> </DB1> LAV Mapping: DB1  Q(DB3) Associations

  7. The Problem of Schema Matching Input • Schemas S1 and S2 • Possibly data instances for S1 and S2 • Background knowledge • thesauri • validated matches • standard schemas • reference instances • ontologies • constraints (keys, data types etc) Output • Associations between S1 and S2 Goal • Schema matching tools with significant automated support

  8. Schema Matching How is the match result expressed? • Pairs of paths • Lists of paths • Schema names DB2 DB3 course * student * courseID studentID title name type student * studentID course * name courseID type title ? evaluation type letter

  9. Schema Matching What do we match? • Depends on the queries we want to ask • Elements in isolation (leaves in particular) • Substructures • Whole schemas

  10. Motivation • Important component in many applications • Data Integration • Data Migration • E-Commerce • Model Management[Bernstein, Halevy, Pottinger ’00] • Algebra for manipulating models and mappings • Match, Merge, Compose …

  11. Problems • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • Schema matching is a hard problem • Naming and structural differences in schemas • Similar, but non-identical concepts modeled • Multiple data models – SQL DDL, XML, ODMG…

  12. Schema Matching Approaches Combined matchers Individual matchers Schema-based Content-based Hybrid Composite manual composition automatic composition Per-Element Structural Per-Element Constraint-based Constraint-based Constraint-based Linguistic Linguistic • Names • Descriptions • Types • Keys • Graph matching • IR (word frequencies, key terms) • Value pattern and ranges How to match? Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001

  13. Cupid Individual matchers Schema-based Content-based Per-Element Structural Per-Element Constraint-based Constraint-based Constraint-based Linguistic Linguistic • Names • Descriptions • Types • Keys • Graph matching • IR (word frequencies, key terms) • Value pattern and ranges Madhavan, Bernstein and Rahm, VLDB, 2001 Combined matchers Hybrid Composite manual composition automatic composition

  14. Cupid Example Address City Street PO PurchaseOrder POLines Items POShipTo DeliverTo Item Item Name Name City Street Line ItemNumber UnitofMeasure UoM Qty Quantity

  15. Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Generate Mapping Output Mapping LSIM SSIM WSIM

  16. Linguistic Matching • Heuristic name matching • Tokenization of names POOrderNum  PO, Order, Num • Expansion of short-forms, acronyms PO  Purchase, Order; Num Number • Clustering of schema elements based on keywords and data-types Street, City, POAddress  Address • Thesaurus of synonyms, hypernyms, acronyms • Linguistic Similarity coefficient (LSIM)  [0,1]

  17. Structure Matching DeliverTo POShipTo PO PurchaseOrder POLines Items Item Item Name Address Name City Street Line ItemNumber City Street UoM UnitofMeasure Qty Quantity

  18. Structure MatchingMutually Reinforcing Similarity SSIM++ SSIM++ SSIM++ PO PurchaseOrder WSIM > thhigh POLines Items WSIM > thhigh Item Item Line ItemNum UoM UnitofMeasure Qty Quantity

  19. Structure MatchingContext Dependent Disambiguation Address SSIM-- City Street SSIM++ PO PurchaseOrder POBillTo InvoiceTo DeliverTo POShipTo Address City Street City Street SSIM++ City Street

  20. Intuition • Atomic elements are similar • Linguistically and data-type similar • Their ancestors are similar • Compound elements (non-leaf) are similar if • Linguistically similar • Subtrees rooted at the elements are similar • Mutually recursive • Leaves determine internal node similarity • Similarity of internal nodes leads to increase in leaf similarity

  21. Structure Match Details • Subtrees are similar if • Immediate children are similar • Leaf sets are similar • Subtree Similarity (nodes s and t) • Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa • Less sensitive to variation in intermediate structure • Pruning the number of comparisons • Elements must have comparable number of leaves

  22. Referential Integrity Order-Customer-fk Customer-Purchase-Order Order-Customer-fk • Join nodes added to the schema tree for each referential integrity constraint • Views can be similarly used Customer Purchase Order Customer ID Order ID Address Customer ID Name Product Name Schema A Schema B

  23. Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Generate Mapping Output Mapping LSIM SSIM WSIM Linguistic Similarity (LSIM) Structural (SSIM), Weighted (WSIM) Similarity

  24. Mapping Generation • Individual mapping elements computed from WSIM values: • Consider only mapping pairs that have WSIM greater than threshold • For each element of target find most similar source element • Not accepted mappings with high similarity are returned in order to help user modify map

  25. Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Input hint LSIM SSIM WSIM Generate Mapping Output Mapping

  26. Work Needed • A more robust solution • Auto-tuning parameters • Thesaurus Generation and Evolution • Schema matching component architecture • Easily extensible by adding multiple techniques • Data Instances for matching • Look at COMA & ProtoPlasm systems

  27. References • J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001 • H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002 • P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004

More Related