Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy

Problem & Solution • Problem • Large-scale Data Integration Systems • Bottleneck: Semantic Mappings • Solution • Multi-strategy Learning • Integrity Constraints • XML Structure Learner • 1-1 Mappings

Learning Source Descriptions (LSD) • Components • Base learners • Meta-learner • Prediction converter • Constraint handler • Operating Phases • Training phase • Matching phase

Learners • Basic Learners • Name Matcher (Whirl) • Content Matcher (Whirl) • Naïve Bayes Learner • County-Name Recognizer • XML Learner • Meta-Learner (Stacking)

Naïve Bayes Learner Input instance= bags of tokens

XML Learner Input instance= bags of tokens including text tokens and structure tokens

Domain Constraint Handler • Domain Constraints • Impose semantic regularities on schemas and source data in the domain • Can be specified at the beginning • When creating a mediated schema • Independent of any actual source schema • Constraint Handler • Domain constraints + Prediction Converter + Users’ feedback + Output mappings

Training Phase • Manually Specify Mappings for Several Sources • Extract Source Data • Create Training Data for each Base Learner • Train the Base-Learner • Train the Meta-Learner

Example1 (Training Phase)

Example1 (Cont.) Training Data Source Data

Example1 (Cont.) (“location” ，ADDRESS) (“Miami, FL”, ADDRESS) Source Data: (location: Miami, FL)

Matching Phase • Extract and Collect Data • Match each Source-DTD Tag • Apply the Constraint Handler

Example2 (Matching Phase)

Example2 (Cont.)

Experimental Evaluation • Measures • Matching accuracy of a source • Average matching accuracy of a source • Average matching accuracy of a domain • Experiment Results • Average matching accuracy for different domains • Contributions of base learners and domain constraint handler • Contributions of schema information and instance information • Performance sensitivity to the amount data instances

Limitations • Enough Training Data • Domain Dependent Learners • Ambiguities in Sources • Efficiency • Overlapping of Schemas

Conclusion and Future Work • Improve over time • Extensible framework • Multiple types of knowledge • Non 1-1 mapping ?

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach