480 likes | 573 Views
Learning to Map Between Schemas Ontologies. Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos. Agenda. Ontology mapping is a key problem in many applications: Data integration Semantic web Knowledge management E-commerce LSD:
E N D
Learning to Map Between Schemas Ontologies Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos
Agenda • Ontology mapping is a key problem in many applications: • Data integration • Semantic web • Knowledge management • E-commerce • LSD: • Solution that uses multi-strategy learning. • We’ve started with schema matching (I.e., very simple ontologies) • Currently extending to more expressive ontologies. • Experiments show the approach is very promising!
The Structure Mapping Problem • Types of structures: • Database schemas, XML DTDs, ontologies, …, • Input: • Two (or more) structures, S1 and S2 • Data instances for S1 and S2 • Background knowledge • Output: • A mapping between S1 and S2 • Should enable translating between data instances. • Semantics of mapping?
Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone
Motivation • Database schema integration • A problem as old as databases themselves. • database merging, data warehouses, data migration • Data integration / information gathering agents • On the WWW, in enterprises, large science projects • Model management: • Model matching: key operator in an algebra where models and mappings are first-class objects. • See [Bernstein et al., 2000] for more. • The Semantic Web • Ontology mapping. • System interoperability • E-services, application integration, B2B applications, …,
Desiderata from Proposed Solutions • Accuracy, efficiency, ease of use. • Realistic expectations: • Unlikely to be fully automated. Need user in the loop. • Some notion of semantics for mappings. • Extensibility: • Solution should exploit additional background knowledge. • “Memory”, knowledge reuse: • System should exploit previous manual or automatically generated matchings. • Key idea behind LSD.
LSD Overview • L(earning) S(ource) D(escriptions) • Problem: generating semantic mappings between mediated schema and a large set of data source schemas. • Key idea: generate the first mappings manually, and learn from them to generate the rest. • Technique: multi-strategy learning (extensible!) • Step 1: • [SIGMOD, 2001]: 1-1 mappings between XML DTDs. • Current focus: • Complex mappings • Ontology mapping.
Outline • Overview of structure mapping • Data integration and source mappings • LSD architecture and details • Experimental results • Current work.
Data Integration Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 wrappers realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.
Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone
Semantics (preliminary) • Semantics of mappings has received no attention. • Semantics of 1-1 mappings – • Given: • R(A1,…,An) and S(B1,…,Bm) • 1-1 mappings (Ai,Bj) • Then, we postulate the existence of a relation W, s.t.: • P(C1,…,Ck) (W) = P(A1,…,Ak) (R) , • P(C1,…,Ck) (W) = P(B1,…,Bk) (S) , • W also includes the unmatched attributes of R and S. • In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.
Why Matching is Difficult • Aims to identify same real-world entity • using names, structures, types, data values, etc • Schemas represent same entity differently • different names => same entity: • area & address => location • same names => different entities: • area => location or square-feet • Schema & data never fully capture semantics! • not adequately documented, not sufficiently expressive • Intended semantics is typically subjective! • IBM Almaden Lab = IBM? • Cannot be fully automated. Often hard for humans. Committees are required!
Current State of Affairs • Finding semantic mappings is now the bottleneck! • largely done by hand • labor intensive & error prone • GTE: 4 hours/element for 27,000 elements [Li&Clifton00] • Will only be exacerbated • data sharing & XML become pervasive • proliferation of DTDs • translation of legacy data • reconciling ontologies on semantic web • Need semi-automatic approaches to scale up!
Outline • Overview of structure mapping • Data integration and source mappings • LSD architecture and details • Experimental results • Current work.
The LSD Approach • User manually maps a few data sources to the mediated schema. • LSD learns from the mappings, and proposes mappings for the rest of the sources. • Several types of knowledge are used in learning: • Schema elements, e.g., attribute names • Data elements: ranges, formats, word frequencies, value frequencies, length of texts. • Proximity of attributes • Functional dependencies, number of attribute occurrences. • One learner does not fit all. Use multiple learners and combine with meta-learner.
Example Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...
Multi-Strategy Learning • Use a set of baselearners: • Name learner, Naïve Bayes, Whirl, XML learner • And a set of recognizers: • County name, zip code, phone numbers. • Each base learner produces a prediction weighted by confidence score. • Combine base learners with a meta-learner, using stacking.
Base Learners • Name Learner (contact-info,office-address) (contact-info,office-address) (contact,agent-phone) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (phone,agent-phone) (listed-price,price) (listed-price,price) • contact-phone => (agent-phone,0.7), (office-address,0.3) • Naive Bayes Learner[Domingos&Pazzani 97] • “Kent, WA” => (address,0.8), (name,0.2) • Whirl Learner[Cohen&Hirsh 98] • XML Learner • exploits hierarchical structure of XML data
Training the Base Learners Mediated schema address price agent-phone description locationlisted-pricephonecomments Schema of realestate.com Name Learner <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> (location, address) (listed-price, price) (phone, agent-phone) ... realestate.com Naive Bayes Learner <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) ...
Entity Recognizers • Use pre-programmed knowledge to identify specific types of entities • date, time, city, zip code, name, etc • house-area (30 X 70, 500 sq. ft.) • county-name recognizer • Recognizers often have nice characteristics • easy to construct • many off-the-self research & commercial products • applicable across many domains • help with special cases that are hard to learn
Meta-Learner: Stacking • Training of meta-learner produces a weight for every pair of: • (base-learner, mediated-schema element) • weight(Name-Learner,address) = 0.1 • weight(Naive-Bayes,address) = 0.9 • Combining predictions of meta-learner: • computes weighted sum of base-learner confidence scores Name Learner Naive Bayes (address,0.6) (address,0.8) <area>Seattle, WA</> Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)
Training the Meta-Learner • For address Name Learner Naive Bayes True Predictions Extracted XML Instances <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... 0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ... Least-SquaresLinear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9
Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (description,0.8), (address,0.2) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>
The Constraint Handler • Extends learning to incorporate constraints • hard constraints • a = address & b = addressa = b • a = house-ida is a key • a = agent-info & b = agent-nameb is nested in a • soft constraints • a= agent-phone &b= agent-name a&bare usually close to each other • user feedback = hard or soft constraints • Details in [Doan et. al., SIGMOD 2001]
The Current LSD System Training Phase Matching Phase Mediated schema Source schemas Domain Constraints Data listings User Feedback Constraint Handler Base-Learner1 Base-Learnerk Meta-Learner Mappings
Outline • Overview of structure mapping • Data integration and source mappings • LSD architecture and details • Experimental results • Current work.
Empirical Evaluation • Four domains • Real Estate I & II, Course Offerings, Faculty Listings • For each domain • create mediated DTD & domain constraints • choose five sources • extract & convert data listings into XML (faithful to schema!) • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 • Ten runs for each experiment - in each run: • manually provide 1-1 mappings for 3 sources • ask LSD to propose mappings for remaining 2 sources • accuracy = % of 1-1 mappings correctly identified
Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)
Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) • More experiments in the paper [Doan et. al. 01]
Reasons for Incorrect Matching • Unfamiliarity • suburb • solution: add a suburb-name recognizer • Insufficient information • correctly identified general type, failed to pinpoint exact type • <agent-name>Richard Smith</><phone> (206) 234 5412 </> • solution: add a proximity learner • Subjectivity • house-style = description?
Outline • Overview of structure mapping • Data integration and source mappings • LSD architecture and details • Experimental results • Current work.
Moving Up the Expressiveness Ladder • Schemas are very simple ontologies. • More expressive power = More domain constraints. • Mappings become more complex, but constraints provide more to learn from. • Non 1-1 mappings: • F1(A1,…,Am) = F2(B1,…,Bm) • Ontologies (of various flavors): • Class hierarchy (I.e., containment on unary relations) • Relationships between objects • Constraints on relationships
Finding Non 1-1 MappingsCurrent work • Given two schemas, find • 1-many mappings: address = concat(city,state) • many-1: half-baths + full-baths = num-baths • many-many: concat(addr-line1,addr-line2) = concat(street,city,state) • 1-many mappings • expressed as query • value correspondence expression: room-rate = rate * (1 + tax-rate) • relationship: state of tax-rate = state of hotel that has rate • special case: 1-many mappings between two relational tables Mediated schema Source schema address description num-baths city state comments half-baths full-baths
Brute-Force Solution • Define a set of operators • concat, +, -, *, /, etc • For each set of mediated-schema columns • enumerate all possible mappings • evaluate & return best mapping Mediated-schema columns Source-schema columns compute similarity using all base learners m1 m1, m2, ..., mk
Search-Based Solution • States = columns • goal state: mediated-schema column • initial states: all source-schema columns • use 1-1 matching to reduce the set of initial states • Operators: concat, +, -, *, /, etc • Column-similarity: • use all base learners + recognizers
Multi-Strategy Search • Use a set of expert modules: L1, L2, ..., Ln • Each module • applies to only certain types of mediated-schema column • searches a small subspace • uses a cheap similarity measure to compare columns • Example • L1: text; concat; TF/IDF • L2: numeric; +, -, *, /; [Ho et. al. 2000] • L3: address; concat; Naive Bayes • Search techniques • beam search as default • specialized, do not have to materialize columns
Multi-Strategy Search (cont’d) • Combine modules’ predictions & select the best one • Apply all applicable expert modules L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z compute similarity using all base learners m11 m11, m12, m21, m22, m31,m32
Related Work Recognizers + Schema + 1-1 Matching Single Learner + 1-1 Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] Hybrid + 1-1 Matching DELTA [Clifton et. al. 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction CLIO [Miller et. al. 00],[Yan et. al. 01] LSD [Doan et. al. 2000, 2001] ?
Summary • LSD: • uses multi-strategy learning to semi-automatically generate semantic mappings. • LSD is extensible and incorporates domain and user knowledge, and previous techniques. • Experimental results show the approach is very promising. • Future work and issues to ponder: • Accommodating more expressive languages: ontologies • Reuse of learned concepts from related domains. • Semantics? • Data management is a fertile area for Machine Learning research!
Mapping Maintenance • Ten months later ... • are the mappings still correct? Mediated-schema M Source-schema S m1 m2 m3 Mediated-schema M’ Source-schema S’ m1 m2 m3
Information Extraction from Text • Extract data fragments from text documents • date, location, & victim’s name from a news article • Intensive research on free-text documents • Many documents do have substantial structure • XML pages, name card, tables, list • Each such document = a data source • structure forms a schema • only one data value per schema element • “real” data source has many data values per schema element • Ongoing research in the IE community
Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system
Exploiting Hierarchical Structure • Existing learners flatten out all structures • Developed XML learner • similar to the Naive Bayes learner • input instance = bag of tokens • differs in one crucial aspect • consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>
Domain Constraints • Impose semantic regularities on sources • verified using schema or data • Examples • a = address & b = addressa = b • a = house-ida is a key • a = agent-info & b = agent-nameb is nested in a • Can be specified up front • when creating mediated schema • independent of any actual source schema
The Constraint Handler • Can specify arbitrary constraints • User feedback = domain constraint • ad-id = house-id • Extended to handle domain heuristics • a = agent-phone & b = agent-namea & b are usually close to each other Predictions from Meta-Learner Domain Constraints a = address & b = adderssa = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252