170 likes | 308 Views
Semantic Mappings for Data Mediation. Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy. Find houses with 2 bedrooms priced under 300K. realestate.com. homeseekers.com. homes.com. Charlie comes to town. Data Integration.
E N D
Semantic Mappings for Data Mediation Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy
Find houses with 2 bedrooms priced under 300K realestate.com homeseekers.com homes.com Charlie comes to town Affiliates Meeting
Data Integration Find houses with 2 bedrooms priced under 300K mediated schema source schema 1 source schema 3 source schema 2 wrapper wrapper wrapper realestate.com homeseekers.com homes.com Affiliates Meeting
Semantic Mappings between Schemas Mediated schema address agent-name agent-city agent-state 1-1 mapping complex mapping homes.com area contact-name contact-address Denver, CO Laura Smith Boulder, CO Oakland, CA Jean Brown Davis, CA Affiliates Meeting
Why Schema Matching is Important Enterprise 1 Application has more than one schema need for schema matching! Data integration Data integration Data translation Data warehousing E-commerce World-Wide Web Ontology Matching Knowledge Base 2 Information agent Enterprise 2 Homeusers KnowledgeBase1 Affiliates Meeting
Why Schema Matching is Difficult • No access to exact semantics of concepts • Semantics not documented in sufficient details • Schemas not adequately expressive to capture semantics • Must rely on clues in schema & data • Using names, structures, types, data values, etc. • Such clues can be unreliable • Synonyms: Different names => same entity: • area & address => location • Homonyms: Same names => different entities: • area => location or square-feet • Done manually by domain experts • Expensive and time consuming Affiliates Meeting
Previous work • Mostly ad-hoc heuristics • Name matchers • Data types • Sample domain values • Graph matching • Schemas are labeled graphs • No single heuristic works across scenarios • Systems are fragile and need a lot of tuning Affiliates Meeting
How do we go about it? • Make extensive use of data instances • Incorporate multiple heuristics • Base learners that implement individual heuristics • Machine Learning • Multi-strategy learning to combine base learners • Extensible framework • Easy to add new heuristics/learners • Generic and domain specific constraints • Robust solution with high accuracy Affiliates Meeting
Multiple hypotheses Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” occur frequently in data instances => description If “office” occurs in the name => office-phone Content matcher Name matcher Affiliates Meeting
Content Learner Name Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Base Learners Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Affiliates Meeting
Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for data instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements Weights for Base Learners Meta-Learner Constraint Handler Mappings Learning Source Descriptions (LSD) [SIGMOD’01] Affiliates Meeting
LSD’s performance Avg. Matching Accuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% Complete LSD system: + 0.8 - 6% Affiliates Meeting
Matching Ontologies of Concepts • Each ontology has an inheritance tree (taxonomy) and data instances at the leaves. • For each concept find most similar concept in the other ontology. CS Dept U.S. CS Dept Australia Undergrad Courses Grad Courses Courses People Staff Faculty Staff AcademicStaff TechnicalStaff Assistant Professor Associate Professor Senior Lecturer Professor Lecturer Professor Affiliates Meeting
The Glue System [WWW’2002] • No manually performed mappings • Automatically collect training data for base learners. • Similarity measures computed from the joint probability distribution of concepts • A random data instance can belong to both, either, neither concepts – P(A,B), P(A,B’), P(A’,B), P(A’,B’). • General framework for incorporating constraints • Extension of relaxation labeling. Affiliates Meeting
The Glue System Mappings for O1 , Mappings for O2 Relaxation Labeling Similarity Matrix Common Knowledge & Domain Constraints Similarity Estimator Joint Probability Distribution P(A,B), P(A’, B)… Similarity Function Distribution Estimator Meta Learner Base Learner Base Learner Taxonomy O1 (tree structure + data instances) Taxonomy O2 (tree structure + data instances) Affiliates Meeting
Glue’s performance Affiliates Meeting
Conclusion and Future Work • LSD and Glue perform well • Combine predictions of different base learners • Incorporate constraints • Robust solution that results in good accuracy • Future Work • Representation mapping system • Incorporates various heuristics • Can perform complex mappings • Can learn with experience. • Reasoning about mappings • Does a mapping enable answering of queries posed on other schema? • Is one mapping implied by another? Is a mapping minimal? • Can mappings be composed? Affiliates Meeting