470 likes | 563 Views
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University. Sponsored by Sun Microsystems. Learning to Reconcile Semantic Heterogeneity. Alon Halevy University of Washington, Seattle NEDS, April 23, 2004. Large-Scale Data Sharing.
E N D
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems
Learning to Reconcile Semantic Heterogeneity Alon Halevy University of Washington, Seattle NEDS, April 23, 2004
Large-Scale Data Sharing • Large-scale data sharing is pervasive: • Big science (bio-medicine, astrophysics, …) • Government agencies • Large corporations • The web (over 100,000 searchable data sources) • “Enterprise Information Integration” industry • The vision: • Content authoring by anyone, anywhere • Powerful database-style querying • Use relevant data from anywhere to answer the query • The Semantic Web • Fundamental problem: reconciling different models of the world.
Large-Scale Scientific Data Sharing Swiss- Prot HUGO OMIM UW UW Microbiology UW Genome Sciences Harvard Genetics GeneClinics
Data Integration Entity www.biomediator.org Tarczy-Hornoch, Mork Sequenceable Entity Structured Vocabulary Experiment Phenotype Gene Nucleotide Sequence Microarray Experiment Protein OMIM HUGO Swiss- Prot GO Gene- Clinics Locus- Link Entrez GEO Query: For the micro-array experiment I just ran, what are the related nucleotide sequences and for what protein do they code?
Q3 Q1 Q4 Q5 Q6 Q Q2 Peer Data Management Systems Piazza: [Tatarinov, H., Ives, Suciu, Mork] • Mappings specified locally • Map to most convenient nodes • Queries answered by traversing semantic paths. CiteSeer Stanford UW DBLP Brown M.I.T Brandeis
Mediated Schema Q’’ R1 R2 R3 R4 R5 CiteSeer UW Stanford Q’ Q’’ Q’’ DBLP Q Q’’ Q’ Paris Roma Vienna Data Sharing Architectures • Data integration • PDMS • Message passing • Web services • Data warehousing
Q Q’ Q’ Q’ Semantic Mappings • Formalism for mappings • Reformulation algorithms Mediated Schema • How will we create them? … …
Semantic Mappings: Example Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName • Differences in: • Names in schema • Attribute grouping • Coverage of databases • Granularity and format of attributes BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B
Why is Schema Matching so Hard? • Because the schemas never fully capture their intended meaning: • Schema elements are just symbols. • We need to leverage any additional information we may have. • ‘Theorem’: Schema matching is AI-Complete. • Hence, human will always be in the loop. • Goal is to improve designer’s productivity. • Solution must be extensible.
Dimensions of the Problem (1) Matching vs. Mapping Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName Schema Matching:Discovering correspondences between similar elements Schema Mapping: BooksAndMusic(x:Title,…) = Books(x:Title,…) CDs(x:Album,…) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Inventory Database A Artists ASIN ArtistName GroupName Inventory Database B
Dimensions of the Problem (2) • Schema level vs. instance level: • Alon Halevy, A. Halevy, Alon Y. Levy – same guy! • Can’t always separate the two levels. • Crucial for Personal Info Management (See Semex) • What are we mapping? • Schemas • Web service descriptions • Business logic and processes • Ontologies
Important Special Cases • Mapping to a common mediated schema? • Or mapping two arbitrary schemas? • One schema may be a new version of the other. • The two schemas may be evolutions of the same original schema. • Web forms. • Horizontal integration: many sources talking about the same stuff. • Vertical integration: sources covering different parts of the domain, and have only little overlap.
Problem Definition • Given • S1 and S2: a pair of schemas/DTDs/ontologies,… • Possibly, data accompanying instances • Additional domain knowledge • Find: • A match between S1 and S2 • A set of correspondences between the terms.
Outline • Motivation and problem definition • Learning to match to a mediated schema • Matching arbitrary schemas using a corpus • Matching web services.
Typical Matching Heuristics[See Rahm & Bernstein, VLDBJ 2001, for a survey] • Build a model for every element from multiple sources of evidences in the schemas • Schema element names • BooksAndCDs/Categories ~ BookCategories/Category • Descriptions and documentation • ItemID: unique identifier for a book or a CD • ISBN: unique identifier for any book • Data types, data instances • DateTime Integer, • addresses have similar formats • Schema structure • All books have similar attributes In isolation, techniques are incomplete or brittle: Need principled combination. [See the Coma System] Models consider only the two schemas.
Matching to a Mediated Schema[Doan et al., SIGMOD 2001, MLJ 2003] Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 realestate.com homeseekers.com homes.com
Finding Semantic Mappings house • Source schemas = XML DTDs address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone
Learning from Previous Matching • Every matching task is a learning opportunity. • Several types of knowledge are used in learning: • Schema elements, e.g., attribute names • Data elements: ranges, formats, word frequencies, value frequencies, length of texts. • Proximity of attributes • Functional dependencies, number of attribute occurrences.
Matching Real-Estate Sources Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...
Learning to Match Schemas Training Phase Matching Phase Mediated schema Source schemas Domain Constraints Data listings User Feedback Constraint Handler Base-Learner1 Base-Learnerk Meta-Learner Mappings Multi-strategy Learning System
Multi-Strategy Learning • Use a set of baselearners: • Name learner, Naïve Bayes, Whirl, XML learner • And a set of recognizers: • County name, zip code, phone numbers. • Each base learner produces a prediction weighted by confidence score. • Combine base learners with a meta-learner, using stacking.
Base Learners (contact-info,office-address) (contact-info,office-address) (contact,agent-phone) (contact,agent-phone) • Name Learner (contact-phone, ? ) (phone,agent-phone) (phone,agent-phone) (listed-price,price) (listed-price,price) • contact-phone => (agent-phone,0.7), (office-address,0.3) • Naive Bayes Learner[Domingos&Pazzani 97] • “Kent, WA” => (address,0.8), (name,0.2) • Whirl Learner[Cohen&Hirsh 98] • XML Learner • exploits hierarchical structure of XML data
Meta-Learner: Stacking • Training of meta-learner produces a weight for every pair of: • (base-learner, mediated-schema element) • weight(Name-Learner,address) = 0.1 • weight(Naive-Bayes,address) = 0.9 • Combining predictions of meta-learner: • computes weighted sum of base-learner confidence scores Name Learner Naive Bayes (address,0.6) (address,0.8) <area>Seattle, WA</> Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)
Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (description,0.8), (address,0.2) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>
Empirical Evaluation • Four domains • Real Estate I & II, Course Offerings, Faculty Listings • For each domain • create mediated DTD & domain constraints • choose five sources • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 • Ten runs for each experiment - in each run: • manually provide 1-1 mappings for 3 sources • ask LSD to propose mappings for remaining 2 sources • accuracy = % of 1-1 mappings correctly identified
Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
Outline • Motivation and problem definition • Learning to match to a mediated schema • Matching arbitrary schemas using a corpus • Matching web services.
Music Books Authors Authors Items Artists Publisher Information Litreture CDs Categories Artists Corpus of Schemas and Matches Corpus-Based Schema Matching[Madhavan, Doan, Bernstein, Halevy] • Can we use previous experience to match two newschemas? • Learn about a domain, rather than a mediated schema? Classifier for every corpus element Learn general purpose knowledge Reuse extracted knowledge to match new schemas
Exploiting The Corpus • Given an element s S and t T, how do we determine if s and t are similar? • The PIVOT Method: • Elements are similar if they are similar to the same corpus concepts • The AUGMENT Method: • Enrich the knowledge about an element by exploiting similar elements in the corpus.
Pivot: measuring (dis)agreement Compute interpretations w.r.t. corpus Pk= Probability (s ~ ck ) • Interpretation captures how similar an element is to each corpus concept • Compared using cosine distance. Interpretation I(s) = element sSchema S # concepts in corpus S T I(s) I(t) s t Similarity(I(s), I(t))
Augmenting element models S Schema Search similar corpus concepts s • Search similar corpus concepts • Pick the most similar ones from the interpretation • Build augmented models • Robust since more training data to learn from • Compare elements using the augmented models Corpus of known schemas and mappings e f M’s Name: Instances: Type: … Element Model Build augmented models
Experimental Results • Five domains: • Auto and real estate: webforms • Invsmall and inventory: relational schemas • Nameaddr: real xml schemas • Performance measure: • F-Measure: Precision and recall are measured in terms of the matches predicted. • Results averaged over hundreds of schema matching tasks!
Comparison over domains Corpus based techniques perform better in all the domains
“Tough” schema pairs Significant improvement in difficult to match schema pairs
Mixed corpus Corpus with schemas from different domains can also be useful
Other Corpus Based Tools • A corpus of schemas can be the basis for many useful tools: • Mirror the success of corpora in IR and NLP? • Auto-complete: • I start creating a schema (or show sample data), and the tool suggests a completion. • Formulating queries on new databases: • I ask a query using my terminology, and it gets reformulated appropriately.
Outline • Motivation and problem definition • Learning to match to a mediated schema • Matching arbitrary schemas using a corpus • Matching web services.
Searching for Web Services[Dong, Madhavan, Nemes, Halevy, Zhang] • Over 1000 web services already on WWW. • Keyword search is not sufficient. • Search involves drill-down; don’t want to repeat it. Hence, • Find similar operations • Find operations that compose with this one.
1) Operations With Similar Functionality • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity Similar Operations
2) Operations with Similar Inputs/Outputs • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State Similar Inputs
3) Composable Operations • Op1: GetTemperature • Input: Zip, Authorization • Output: Return • Op2: WeatherFetcher • Input: PostCode • Output: TemperatureF, WindChill, Humidity • Op3: LocalTimeByZipcode • Input: Zipcode • Output: LocalTimeByZipCodeResult • Op4: ZipCodeToCityState • Input: ZipCode • Output: City, State • Op5: CityStateToZipCode • Input: City, State • Output: ZipCode Input of Op2 is similar to Output of Op5 Composition
Why is this Hard? • Little to go on: • Input/output parameters (they don’t mean much) • Method name • Text descriptions of operation or web service (typically bad) • Difference from schema matching: • Web service not a coherent schema • Different level of granularity.
Main Ideas • Measure similarity of each of the components of the WS-operation: I, O, description, WS description. • Cluster parameter names into concepts. • Heuristic: Parameters occurring together tend to express the same concepts • When comparing inputs/outputs, compare parameters and concepts separately, and combine the results.
Woogle • A collection of 790 web services431 active web services, 1262 operations • Function • Web service similarity search • Keyword search on web service descriptions • Keyword search on inputs/outputs • Web service category browse • Web service on-site try • Web service status report • http://www.cs.washington.edu/woogle
Conclusion • Semantic reconciliation is crucial for data sharing. • Learning from experience: an important ingredient. • See Transformic Inc. • Current challenges: large schemas, GUIs, dealing with other meta-data issues.