1 / 30

Generic Schema Matching using Cupid

Generic Schema Matching using Cupid. Jayant Madhavan University of Washington. Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig. PO. PurchaseOrder. POLines. Items. DeliverTo. POShipTo. POShipTo. POShipTo.

amora
Download Presentation

Generic Schema Matching using Cupid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Generic Schema Matching using Cupid Jayant Madhavan University of Washington Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig

  2. PO PurchaseOrder POLines Items DeliverTo POShipTo POShipTo POShipTo DeliverTo DeliverTo Item Item Name Address Name City Street Line ItemNumber City Street UnitofMeasure UoM Line ItemNumber Qty Quantity UnitofMeasure UnitofMeasure UoM UoM Qty Qty Quantity Quantity Schema Matching VLDB 2001 Roma Italy

  3. The Problem • Given two schemas obtain a mapping between them that identifies corresponding elements • Minimize user involvement (semi-automatic) • Data model independent matching (generic) • A hard problem • Naming and structural differences in schemas • Similar, but non-identical concepts modeled • Multiple data models – SQL DDL, XML, ODMG… VLDB 2001 Roma Italy

  4. Motivation • Important component in many applications • Data Integration • Data Migration • E-Commerce • Model Management [Bernstein, Halevy, Pottinger ’00] • Algebra for manipulating models and mappings • Match, Merge, Compose … VLDB 2001 Roma Italy

  5. Individual matchers Schema-based Content-based Hybrid Composite manual composition Per-Element Structural Per-Element automatic composition Constraint-based Constraint-based Constraint-based Linguistic Linguistic • Names • Descriptions • Types • Keys • Graph matching • IR (word frequencies, key terms) • Value pattern and ranges Schema Matching Approaches Combined matchers Taxonomy based survey [Rahm,Bernstein’00] VLDB 2001 Roma Italy

  6. Related Work • Hybrid approaches for schema integration • DIKE [Palopoli, Sacca, Ursino, Terracina] • MOMIS [Bergamaschi, Castano, Vincini] • Linguistic and Instance based • SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li] • Instance based Multi-strategy learning • LSD [Doan, Domingos, Halevy] • Others • Hybrid rule based - Transcm [Milo, Zohar] • Query Discovery - CLIO [Haas, Hernandez, Miller] VLDB 2001 Roma Italy

  7. Contributions • Taxonomy of schema matching approaches • Cupid system that exploits linguistic, data-type, structure and referential integrity information • New algorithm that exploits schema structure • Experimental validation and comparison with other systems VLDB 2001 Roma Italy

  8. Schema 1 Linguistic Matching Schema 2 Thesaurus Structure Matching Generate Mapping Output Mapping Cupid architecture LSIM SSIM WSIM VLDB 2001 Roma Italy

  9. Linguistic Matching • Heuristic name matching • Tokenization of names POOrderNum  PO, Order, Num • Expansion of short-forms, acronyms PO Purchase, Order; Num Number • Clustering of schema elements based on keywords and data-types Street, City, POAddress  Address • Thesaurus of synonyms, hypernyms, acronyms • Linguistic Similarity coefficient (lsim)  [0,1] VLDB 2001 Roma Italy

  10. DeliverTo DeliverTo POShipTo POShipTo Item Item Name Name City Street Line ItemNumber City Street UoM UnitofMeasure Qty Quantity Structure Matching PO PurchaseOrder POLines Items Item Item Address Name City Street Line ItemNumber City Street UoM UnitofMeasure Qty Quantity VLDB 2001 Roma Italy

  11. PO PurchaseOrder POLines POLines Items Items Item Item Item Item Ssim ++ Ssim ++ Line Line ItemNum ItemNum Ssim ++ Ssim ++ UoM UoM UnitofMeasure UnitofMeasure Ssim ++ Ssim ++ Qty Qty Quantity Quantity Structure MatchMutually Reinforcing Similarity PO PurchaseOrder Wsim > thhigh POLines Items Wsim > thhigh Item Item Line ItemNum UoM UnitofMeasure Qty Quantity VLDB 2001 Roma Italy

  12. POBillTo InvoiceTo POBillTo InvoiceTo InvoiceTo InvoiceTo POBillTo POBillTo DeliverTo POShipTo POShipTo POShipTo POShipTo POShipTo Address Address Address City City City City City City Ssim-- City Street Ssim++ Structure MatchContext dependent disambiguation PO PurchaseOrder POBillTo InvoiceTo DeliverTo POShipTo Address Address Address City Street City Street Ssim++ City Street VLDB 2001 Roma Italy

  13. Intuition • Atomic elements are similar • Linguistically and data-type similar • Their ancestors are similar • Compound elements (non-leaf) are similar if • Linguistically similar • Subtrees rooted at the elements are similar • Mutually recursive • Leaves determine internal node similarity • Similarity of internal nodes leads to increase in leaf similarity VLDB 2001 Roma Italy

  14. Structure Match details • Subtrees are similar if • Immediate children are similar • Leaf sets are similar • Subtree Similarity (nodes s and t) • Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa • Less sensitive to variation in intermediate structure • Pruning the number of comparisons • Elements must have comparable number of leaves VLDB 2001 Roma Italy

  15. Order-Customer-fk Order-Customer-fk Schema B Customer-Purchase-Order Referential Integrity Customer Purchase Order • Join nodes added to the schema tree for each referential integrity constraint • Views can be similarly used Customer ID Order ID Address Customer ID Name Product Name Schema A VLDB 2001 Roma Italy

  16. Schema 1 Schema 2 Cupid architecture Lsim Linguistic Matching Thesaurus Ssim,Wsim Structure Matching Generate Mapping Output Mapping Linguistic Similarity (Lsim) Structural(Ssim), Weighted(Wsim) similarity VLDB 2001 Roma Italy

  17. Mapping Generation • Individual mapping elements computed from Wsim values • Consider only mapping pairs that have Wsim greater than threshold • For each element of target find most similar source element • Not accepted mappings with high similarity are returned in order to help user modify map VLDB 2001 Roma Italy

  18. Schema 1 Schema 2 Input hint Cupid Architecture Lsim Linguistic Matching Thesaurus Ssim,Wsim Structure Matching Generate Mapping Output Mapping VLDB 2001 Roma Italy

  19. DIKE MOMIS Cupid Canonical Examples Real World Examples Experimental Validation • DIKE • Graph Matching of ER models • No Lsim component (LSPD entries) • MOMIS • Class Level Matching of OO descriptions • Word senses manually chosen from WordNet VLDB 2001 Roma Italy

  20. Evaluation InsightsLinguistic Similarity • Cupid is less sensitive to name variations due to token level manipulations • MOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniques • MOMIS has a interface to WordNet • Word senses need to be chosen manually • Choosing a single sense is not always possible • Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid) VLDB 2001 Roma Italy

  21. Evaluation InsightsStructural Similarity • DIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elements • Leaf structure for sub-tree similarity relaxes requirements on intermediate structure match • Class-level structural similarity in MOMIS can be restrictive while matching schemas with different nesting • Context-dependent matching in Cupid resolves mapping ambiguity • Linguistic similarity with complete path names (and no structural similarity) is insufficient VLDB 2001 Roma Italy

  22. Summary • Taxonomy of schema matching approaches • Cupid system that performs linguistic and structural matching • New algorithm for exploiting schema structure • Comparative evaluation VLDB 2001 Roma Italy

  23. Future Work • Towards a more robust solution • Auto-tuning parameters • Thesaurus Generation and Evolution • More scalability testing • Schema matching component architecture • Easily extensible by adding multiple techniques • Data Instances for matching • Mapping, Expression and Query Discovery • Model Management VLDB 2001 Roma Italy

  24. Model Management • Other recent publications • A Model Theory for Generic Schema Management, DBPL 2001 • Generic Model Management – A Database Infrastructure for Schema Manipulation, CoopIS 2001 • A Vision for Management of Complex Models, Sigmod Record, Dec 2000 • Data Warehouse Scenarios for Model Management, ER 2000 • More information • http://data.cs.washington.edu/model/ • http://www.cs.washington.edu/homes/jayant • MSR Technical Report • Talk to us for a demo VLDB 2001 Roma Italy

  25. End of the talk VLDB 2001 Roma Italy

  26. Quantity=Pounds2Kgs(Qty) ItemNumber=concat(“Itm”,Line) Schema Matching PurchaseOrder PO Items Lines Count Item Item ItemNumber Price Quantity Line Unit Qty • For each Lines create Items • For each Item create Item • ItemNumber = concat(“Itm”, Line) • Price= “Unknown” • Quantity = Pounds2Kgs(Qty) • Count = Number of Item in Lines VLDB 2001 Roma Italy

  27. Tree Match Tree Match (Schema tree S, Schema tree T) For each pair of leaves initialize ssim to be their data-type compatibility For each s in S (post order) For each t in T(post order) Compute ssim(s,t) = structural-similarity(s,t) wsim(s,t) = g(lsim(s,t), ssim(s,t)) If (wsim(s,t) > thhigh) Inc-struct-similarity(leaves(s), leaves(t)) If (wsim(s,t) < thlow) Dec-struct-similarity(leaves(s), leaves(t) VLDB 2001 Roma Italy

  28. PO PurchaseOrder POLines Items POBillTo DeliverTo InvoiceTo Item ItemCount City City Count Address Address Line ItemNumber Street Street UoM UnitofMeasure City Street Qty Quantity City Street Tree Match (example) POShipTo Item VLDB 2001 Roma Italy

  29. Canonical Examples VLDB 2001 Roma Italy

  30. Real world example VLDB 2001 Roma Italy

More Related