180 likes | 284 Views
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach. AnHai Doan Pedro Domingos Alon Halevy. Problem & Solution. Problem Large-scale Data Integration Systems Bottleneck: Semantic Mappings Solution Multi-strategy Learning Integrity Constraints
E N D
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy
Problem & Solution • Problem • Large-scale Data Integration Systems • Bottleneck: Semantic Mappings • Solution • Multi-strategy Learning • Integrity Constraints • XML Structure Learner • 1-1 Mappings
Learning Source Descriptions (LSD) • Components • Base learners • Meta-learner • Prediction converter • Constraint handler • Operating Phases • Training phase • Matching phase
Learners • Basic Learners • Name Matcher (Whirl) • Content Matcher (Whirl) • Naïve Bayes Learner • County-Name Recognizer • XML Learner • Meta-Learner (Stacking)
Naïve Bayes Learner Input instance= bags of tokens
XML Learner Input instance= bags of tokens including text tokens and structure tokens
Domain Constraint Handler • Domain Constraints • Impose semantic regularities on schemas and source data in the domain • Can be specified at the beginning • When creating a mediated schema • Independent of any actual source schema • Constraint Handler • Domain constraints + Prediction Converter + Users’ feedback + Output mappings
Training Phase • Manually Specify Mappings for Several Sources • Extract Source Data • Create Training Data for each Base Learner • Train the Base-Learner • Train the Meta-Learner
Example1 (Cont.) Training Data Source Data
Example1 (Cont.) (“location” ,ADDRESS) (“Miami, FL”, ADDRESS) Source Data: (location: Miami, FL)
Matching Phase • Extract and Collect Data • Match each Source-DTD Tag • Apply the Constraint Handler
Experimental Evaluation • Measures • Matching accuracy of a source • Average matching accuracy of a source • Average matching accuracy of a domain • Experiment Results • Average matching accuracy for different domains • Contributions of base learners and domain constraint handler • Contributions of schema information and instance information • Performance sensitivity to the amount data instances
Limitations • Enough Training Data • Domain Dependent Learners • Ambiguities in Sources • Efficiency • Overlapping of Schemas
Conclusion and Future Work • Improve over time • Extensible framework • Multiple types of knowledge • Non 1-1 mapping ?