1 / 23

Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach

The LSD Project. Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. AnHai Doan, Pedro Domingos, Alon Halevy University of Washington. wrapper. wrapper. wrapper. Data Integration. Find houses with four bathrooms priced under $500,000. mediated schema.

kaleb
Download Presentation

Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The LSD Project Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

  2. wrapper wrapper wrapper Data Integration Find houses with four bathrooms priced under $500,000 mediated schema source schema 1 source schema 2 source schema 3 realestate.com homeseekers.com homes.com

  3. Semantic Mappings between Schemas • Mediated & source schemas = XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone

  4. Current State of Affairs • Finding semantic mappings is now the bottleneck! • largely done by hand • labor intensive & error prone • Will only be exacerbated • data sharing & XML become pervasive • proliferation of DTDs • translation of legacy data • reconciling ontologies on the semantic web • Need (semi-)automatic approaches to scale up!

  5. The LSD (Learning Source Descriptions) Approach Suppose user wants to integrate 100 data sources 1. User • manually creates mappings for a few sources, say 3 • shows LSD these mappings 2. LSD learns from the mappings 3. LSD proposes mappings for remaining 97 sources

  6. Example Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

  7. Our Contributions 1. Use of multi-strategy learning • well-suited to exploit multiple types of knowledge • highly modular & extensible 2. Extend learning to incorporate constraints • handle a wide range of domain & user-specified constraints 3. Develop XML learner • exploit hierarchical nature of XML

  8. Multi-Strategy Learning • Use a set of base learners • each exploits well certain types of information • Match schema elements of a new source • apply the base learners • combine their predictions using a meta-learner • Meta-learner • uses training sources to measure base learner accuracy • weighs each learner based on its accuracy

  9. Base Learners • Input • schema information: name, proximity, structure, ... • data information: value, format, ... • Output • prediction weighted by confidence score • Examples • Name learner • agent-name => (name,0.7), (phone,0.3) • Naive Bayes learner • “Kent, WA” => (address,0.8), (name,0.2) • “Great location” => (description,0.9), (address,0.1)

  10. Training the Learners Mediated schema address price agent-phone description locationlisted-pricephonecomments Schema of realestate.com Name Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description) ... <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> realestate.com Naive Bayes Learner <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description) ...

  11. Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (address,0.6), (description,0.4) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>

  12. Domain Constraints • Impose semantic regularities on sources • verified using schema or data • Examples • a = address & b = addressa = b • a = house-ida is a key • a = agent-info & b = agent-nameb is nested in a • Can be specified up front • when creating mediated schema • independent of any actual source schema

  13. The Constraint Handler • Can specify arbitrary constraints • User feedback = domain constraint • ad-id = house-id • Extended to handle domain heuristics • a = agent-phone & b = agent-namea & b are usually close to each other Predictions from Meta-Learner Domain Constraints a = address & b = adderssa = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252

  14. Putting It All Together: the LSD System Training Phase Matching Phase • Base learners:Name Learner, XML learner, Naive Bayes, Whirl learner • Meta-learner • uses stacking [Ting&Witten99, Wolpert92] • returns linear weighted combination of base learners’ predictions Mediated schema Source schemas Domain Constraints Data listings Training data for base learners User Feedback Constraint Handler L1 L2 Lk Mapping Combination

  15. Empirical Evaluation • Four domains • Real Estate I & II, Course Offerings, Faculty Listings • For each domain • create mediated DTD & domain constraints • choose five sources • extract & convert data listings into XML • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 • Ten runs for each experiment - in each run: • manually provide 1-1 mappings for 3 sources • ask LSD to propose mappings for remaining 2 sources • accuracy = % of 1-1 mappings correctly identified

  16. High Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%

  17. Performance Sensitivity Average matching accuracy (%) Number of data listings per source

  18. Contribution of Schema vs. Data • More experiments in the paper! Average matching accuracy (%)

  19. Related Work • Rule-based approaches • TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98], CUPID [Madhavan et. al. 01] • utilizeonly schema information • Learner-based approaches • SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] • employ a single learner, limited applicability • Others • DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01] • Multi-strategy learning in other domains • series of workshops [91,93,96,98,00] • [Freitag98], Proverb [Keim et. al. 99]

  20. Summary • LSD project • applies machine learning to schema matching • Main ideas & contributions • use of multi-strategy learning • extend learning to handle domain & user-specified constraints • develop XML learner • System design: A contribution to generic schema-matching • highly modular & extensible • handle multiple types of knowledge • continuously improve over time

  21. Ongoing & Future Work • Improve accuracy • address current system limitations • Extend LSD to more complex mappings • Apply LSD to other application contexts • data translation • data warehousing • e-commerce • information extraction • semantic web www.cs.washington.edu/homes/anhai/lsd.html

  22. Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

  23. Exploiting Hierarchical Structure • Existing learners flatten out all structures • Developed XML learner • similar to the Naive Bayes learner • input instance = bag of tokens • differs in one crucial aspect • consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>

More Related