280 likes | 437 Views
CSE 636 Data Integration. Schema Matching Cupid. Fall 2006. Virtual Integration Architecture. Wrapper. Wrapper. Design-Time. Run-Time. . Schema Matching. Query Reformulation. Query. Result. End User. Mediation Language. Optimization & Execution. Mediator. Global Schema.
E N D
CSE 636Data Integration Schema Matching Cupid Fall 2006
Virtual Integration Architecture Wrapper Wrapper Design-Time Run-Time Schema Matching Query Reformulation Query Result End User Mediation Language Optimization & Execution Mediator Global Schema Web Services XML Data Source Data Source Local Schema Local Schema
Schema Heterogeneity Independently created schemas… … might be modeling similar information… … in slightly different ways DB1 DB2 DB3 ugrad * course * student * ugradID courseID studentID title name name type enrollment * student * courseID studentID course * ugradID name courseID grade type title ? evaluation type letter
Schema Heterogeneity • Similar entities represented • Dissimilar structures (inverted nesting) • Different element names for similar data values • Similar element names for different data values DB1 DB2 DB3 ugrad * course * student * ugradID courseID studentID title name name type enrollment * student * courseID studentID course * ugradID name courseID grade type title ? evaluation type letter
Schema Matching vs. Schema Mapping • GAV and LAV are schema mapping languages • Mappings: • set of queries • associations + semantics • Match: • set of associations only • Schema Matching: • Identifying associations • First step towards constructing mappings
Schema Matching vs. Schema Mapping DB1 DB3 ugrad * student * ugradID studentID name name type enrollment * courseID course * ugradID courseID grade title ? type letter Semantics for $s1 in DB3/student where $s1/type = ‘UGRAD’ return <DB1> <ugrad> <ugradID>{$s1/studentID}</ugradID> <name>{$s1/name}</name> </ugrad> </DB1> LAV Mapping: DB1 Q(DB3) Associations
The Problem of Schema Matching Input • Schemas S1 and S2 • Possibly data instances for S1 and S2 • Background knowledge • thesauri • validated matches • standard schemas • reference instances • ontologies • constraints (keys, data types etc) Output • Associations between S1 and S2 Goal • Schema matching tools with significant automated support
Schema Matching How is the match result expressed? • Pairs of paths • Lists of paths • Schema names DB2 DB3 course * student * courseID studentID title name type student * studentID course * name courseID type title ? evaluation type letter
Schema Matching What do we match? • Depends on the queries we want to ask • Elements in isolation (leaves in particular) • Substructures • Whole schemas
Motivation • Important component in many applications • Data Integration • Data Migration • E-Commerce • Model Management[Bernstein, Halevy, Pottinger ’00] • Algebra for manipulating models and mappings • Match, Merge, Compose …
Problems • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • Schema matching is a hard problem • Naming and structural differences in schemas • Similar, but non-identical concepts modeled • Multiple data models – SQL DDL, XML, ODMG…
Schema Matching Approaches Combined matchers Individual matchers Schema-based Content-based Hybrid Composite manual composition automatic composition Per-Element Structural Per-Element Constraint-based Constraint-based Constraint-based Linguistic Linguistic • Names • Descriptions • Types • Keys • Graph matching • IR (word frequencies, key terms) • Value pattern and ranges How to match? Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001
Cupid Individual matchers Schema-based Content-based Per-Element Structural Per-Element Constraint-based Constraint-based Constraint-based Linguistic Linguistic • Names • Descriptions • Types • Keys • Graph matching • IR (word frequencies, key terms) • Value pattern and ranges Madhavan, Bernstein and Rahm, VLDB, 2001 Combined matchers Hybrid Composite manual composition automatic composition
Cupid Example Address City Street PO PurchaseOrder POLines Items POShipTo DeliverTo Item Item Name Name City Street Line ItemNumber UnitofMeasure UoM Qty Quantity
Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Generate Mapping Output Mapping LSIM SSIM WSIM
Linguistic Matching • Heuristic name matching • Tokenization of names POOrderNum PO, Order, Num • Expansion of short-forms, acronyms PO Purchase, Order; Num Number • Clustering of schema elements based on keywords and data-types Street, City, POAddress Address • Thesaurus of synonyms, hypernyms, acronyms • Linguistic Similarity coefficient (LSIM) [0,1]
Structure Matching DeliverTo POShipTo PO PurchaseOrder POLines Items Item Item Name Address Name City Street Line ItemNumber City Street UoM UnitofMeasure Qty Quantity
Structure MatchingMutually Reinforcing Similarity SSIM++ SSIM++ SSIM++ PO PurchaseOrder WSIM > thhigh POLines Items WSIM > thhigh Item Item Line ItemNum UoM UnitofMeasure Qty Quantity
Structure MatchingContext Dependent Disambiguation Address SSIM-- City Street SSIM++ PO PurchaseOrder POBillTo InvoiceTo DeliverTo POShipTo Address City Street City Street SSIM++ City Street
Intuition • Atomic elements are similar • Linguistically and data-type similar • Their ancestors are similar • Compound elements (non-leaf) are similar if • Linguistically similar • Subtrees rooted at the elements are similar • Mutually recursive • Leaves determine internal node similarity • Similarity of internal nodes leads to increase in leaf similarity
Structure Match Details • Subtrees are similar if • Immediate children are similar • Leaf sets are similar • Subtree Similarity (nodes s and t) • Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa • Less sensitive to variation in intermediate structure • Pruning the number of comparisons • Elements must have comparable number of leaves
Referential Integrity Order-Customer-fk Customer-Purchase-Order Order-Customer-fk • Join nodes added to the schema tree for each referential integrity constraint • Views can be similarly used Customer Purchase Order Customer ID Order ID Address Customer ID Name Product Name Schema A Schema B
Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Generate Mapping Output Mapping LSIM SSIM WSIM Linguistic Similarity (LSIM) Structural (SSIM), Weighted (WSIM) Similarity
Mapping Generation • Individual mapping elements computed from WSIM values: • Consider only mapping pairs that have WSIM greater than threshold • For each element of target find most similar source element • Not accepted mappings with high similarity are returned in order to help user modify map
Cupid Architecture Linguistic Matching Schema 1 Schema 2 Thesaurus Structure Matching Input hint LSIM SSIM WSIM Generate Mapping Output Mapping
Work Needed • A more robust solution • Auto-tuning parameters • Thesaurus Generation and Evolution • Schema matching component architecture • Easily extensible by adding multiple techniques • Data Instances for matching • Look at COMA & ProtoPlasm systems
References • J. Madhavan, P. A. Bernstein, E. RahmGeneric Schema Matching with CupidVLDB, 2001 • H. H. Do, E. Rahm:COMA - A System for Flexible Combination of Schema Matching ApproachesVLDB, 2002 • P. A. Bernstein, S. Melnik, M. Petropoulos, C. QuixIndustrial-Strength Schema MatchingSIGMOD Record 33(4), 2004