170 likes | 191 Views
Schema Matching Algorithms. Phil Bernstein CSE 590sw February 2003. Acknowledgments. Many of these slides are from presentations by Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig). Mapping Schemas.
E N D
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003
Acknowledgments • Many of these slides are from presentations by • Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig)
Mapping Schemas • Given two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). • Applications • Web site integration • Catalog integration • Schema evolution • Data translation • Reverse engineering • Data warehouse loading • XML message translation • Ontology integration
Partitioning the Problem • Schema matching (aka mapping discovery) • Given two schemas, return a set of corres-pondences that specify pairs of related terms • Semantic Mapping (aka query discovery) • Given correspondences between two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation).
TheSchema Matching Problem • Types of schemas: • Database schemas, XML schemas, ontologies, …, • Input: • Two (or more) schemas, S1 and S2 • Possibly data instances for S1 and S2 • Background knowledge – thesauri, validated matches, constraints (keys, data types), standard schemas, ontologies, NLP, etc. • Output: • A mapping between S1 and S2 Phils’ mods are in underscored Times Roman Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Tool 3 (Data Warehousing schemas) Tool 2 (E-Business schemas) Tool 4 (Database Design) Tool 1 (Portal schemas) Schema import/ export Global libraries (dictionaries, schemas …) Generic Match Implementation Internal Schema Representation Generic Match Implementation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Match Example 1 Epinions.com Yahoo! Shopping Electronics and Photography Electronics Camcorders Video DV Camcorders Remote Controls Computers and Software Digital Cameras Home PDAs and Handhelds Computers & Internet Computer Hardware PDAs Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Containment link Node Match Example 2 CREATE TABLE PurchaseOrder.Customer ( custNo INT, custName VARCHAR(50), custStreet VARCHAR(50), custCity VARCHAR(50), custZip VARCHAR(10), PRIMARY KEY (custNo) ) ; CREATE TABLE PurchaseOrder.ShipTo ( poNo INT, custNo INT REFERENCES PO1.Customer, shipToStreet VARCHAR(50), shipToCity VARCHAR(50), shipToZip VARCHAR(10), PRIMARY KEY (poNo) ) ; <xsd:schema xmlns:xsd="http://www.w3.org/XMLSchema"> <xsd:complexType name=“POrder" > <xsd:sequence> <xsd:element name=“DeliverTo" type="Address"/> <xsd:element name=“BillTo" type="Address"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Address" > <xsd:sequence> <xsd:element name=“Street" type="xsd:string"/> <xsd:element name=“City" type="xsd:string"/> <xsd:element name=“Zip" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema> a) A relational schema and an XML schema POrder PurchaseOrder Customer ShipTo DeliverTo BillTo custName Address custStreet shipToStreet shipToCity custCity custZip shipToZip City Street Zip Legends: b) Their corresponding graph representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Tool Example (Biztalk Mapper) Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Current Situation • Finding mappings is now the bottleneck! • largely done by hand • labor intensive, tedious & error prone • Will only get worse • data sharing & XML become pervasive • proliferation of DTDs and XML schemas • translation of legacy data • reconciling ontologies on semantic web • Need semi-automatic approaches to scale up! Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Why Matching is Difficult • Aims to identify same real-world entity • using names, structures, types, data values, etc. • Schemas represent same entity differently • different names => same entity (synonyms): • client & user => customer • same names => different entities (homonyms): • bug=>insect or software error • Schema & data never fully capture semantics! • not adequately documented, not sufficiently expressive • data values suffer from synonyms and homonyms too • Intended semantics is typically subjective! • Cannot be fully automated. Often hard for humans. Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Desiderata for Match Solution • Low degree of manual work • Accuracy, efficiency, ease of use • Extensibility • Exploit additional match techniques • Exploit additional background knowledge • Support for Reuse • Exploit previous manual or automatically generated matchings • Generic approach • Different schema languages • Different application areas Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Special Situations • Match schema S to an incremental modification of S • Can ignore homonyms and possibly synonyms • Little if any reshaping of the structure • Instances probably don’t help • Lightweight integration for the semantic web vs E-commerce or data warehouse loading. • The former can’t afford much human review • The latter needs “perfect” mappings and hence human review
Automatic Match Approaches • Individual approaches • Combining approaches: hybrid vs. composite Schema-based Instance-based Reuse-oriented Element Structure Element Structure Element • Dictionaries • Thesauri • Previous match results Constraint-based Constraint-based Linguistic Constraint-based Linguistic • Types • Keys • Parents • Children • Leaves • IR (word frequencies, key terms) • Value pattern and ranges • Names • Descriptions Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Real matches Suggested matches A: False Negatives B: True Positives C: False Positives D: True Negatives A B C D Match Quality Measures • Comparison of automatically with manually (i.e. real) derived match correspondences • Quality measures: • Overall : post-match effort to add missed (A) and to remove false matches (C): negative Overall no gain SimilarityFlooding [ICDE02]: Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
UW & MSR Contributions • LSD [SIGMOD ] • Learning algorithm based on structures and instances • AnHai Doan, Pedro Domingos, Alon Halvey • Cupid [VLDB 02] • Structure matching • Jayant Madhavan, Phil Bernstein • Glue [WWW 03] • Taxonomy matching. Uses relaxation labeling. • An Hai Doan, Jayant Madhavan, Pedro Domingos,Alon Halevy • Mapping Knowledge Base (MKB) [submitted] • Reuses past mappings to help produce new mappings • Jayant Madhavan, Phil Bernstein, Chuang Chen, Alon Halevy, Pradeep Shenoy
Where’s the Research Action? • There’s always room for new techniques • Compare distribution of data values between elements of two schemas • Create a global schema by clustering elements from different sample schemas • Re-using mappings • Combining techniques • Better user interfaces