360 likes | 443 Views
Semantic Enrichment of Mappings. Patrick Arnold. Outline. 1. Motivation 2. Goals 3. Related Work 4 . Determining the Relation Type 5. Implementation 6. First Results 7. Conclusions. 1. Motivation.
E N D
SemanticEnrichmentofMappings Patrick Arnold WDI-Lab, AbteilungfürDatenbanken, Universität Leipzig
Outline AbteilungfürDatenbanken, Inst. fürInformatik, Universität Leipzig 1. Motivation 2. Goals 3. Related Work 4. Determining the Relation Type 5. Implementation 6. First Results 7. Conclusions
1. Motivation WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Classic approaches in schema/ontology matching provide only little information about the correspondences • Source node • Target node • Confidence • Further details are commonly omitted • What kind of relation? • equal, is-a, part-of, overlap • Simple correspondence vs. complex correspondence? • (first name, last name) ↔ name • Transformation functions? • gross price = net price * (1 + sales taxes) • name = first name + “ “ + last name
1. Motivation WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Our intentions: Mapping enrichment • Enhance a mapping by adding further or more-specific information to its correspondences • Useful for merging and transforming schemas/ontologies • Workflow: • Input: A mapping • Mapping enrichment carried out in an independent system (blackbox) • Output is an enriched mapping • Implies a new, more-specific format
1. Motivation WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Typical relation types • Equal • Is-a • Part-of • Overlap • Inverse types: • Equal • Inverse is-a • Has-a • Overlap
2. Goals WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • First Focus: Detecting the relation type of a correspondence • Investigate linguistic methods on element level • Extension by existing strategies possible • equal, is-a, inverse is-a • Later… • Relation type detection on instance level • Exploiting background knowledge • Correspondence type, transformation rules, …
3. Related Work WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Several projects dealing with this problem • Mainly based on the following methods: • Using dictionaries, thesauri, corpora • WordNet, GermaNet • Includes tokenization, normalization of strings etc. • Using background knowledge • The Open University: Using Swoogle to retrieve multiple ontologies referring to a concept • Exploiting the structure between ontologies • Exploiting Reasoning, Bayes Nets, Feature Vectors etc. • Search Engines (Google)
3. Related Work WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • SMatch • Complex strategy using WordNet to determine the following relations: • Equal, more-general, less-general, overlap, mismatch • “Overlap” offers few interesting information (concepts are somehow related…) • Approach: To each word in a label, annotate all meanings of this word found in WordNet • Compare/match the meanings of the words • Exploit the relations offered by WordNet
3. Related Work WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • TaxoMap • Focus on geographic ontologies • Detect relations equal, is-a, invis-a and is-close • Focus rather on the correspondence itself, not on the type • Is-a relation if a label in node S appears in node T and is a full word • Use WordNet as additional source • Working on manually pre-defined branches of WordNet instead of the entire thesaurus • Useful for domain-specific ontologies • Recall: 23 %, Precision: 83 %
3. Related Work WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • LogMap • Uses reasoning algorithms to repair/discover mappings • Based on Horn logics and Dowling-Gallier-Algorithm • Use background knowledge (thesauri) • Detects full correspondences and weak correspondences • No specific relation detection per se
4. Relation Type Determination4.1 Introduction WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Typically, there is no link between the syntax and semantics of words • stool, chair, seat… refer to the same object • stool, school, tool, pool, wool… have nothing in common! • Things change when it comes to compounds… • blackbird is a bird • high school is a school
4. Relation Type Determination4.1 Introduction WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Compound: Two words A, B of a language form a new word AB • apple + tree → apple tree • sun + glasses → sunglasses • forth + with → forthwith • A, B can be noun, verb, adjective/adverb, preposition • We are normally interested in nouns
4. Relation Type Determination4.1 Introduction WDI-Lab, AbteilungfürDatenbanken, Universität Leipzig • No compounds are... • Compositions AB where A (or B) is not an official word • broom, nausea • Derivations • discard, unload, increase, compound • Compositions AB where A and B are not semantically related • door (do + or), wither (wit + her)
4. Relation Type Determination4.1 Introduction WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Unlike non-compounds, semantics can be generally derived from the compound’s syntax • Especially in nouns • blackboard is a board • handbag is a bag • Germanic languages are left-branching • Germanic: school bus, central intelligence agency • Romanic: rio de laspalmas(= palm river) • In English, no changes are applied to the words: • German: Ort + Eingang → Ortseingang, Stadt + Bau → Städtebau • English: city + limit → city limit, city + planning → cityplanning
4. Relation Type Determination4.2 Classification WDI-Lab, Abteilung für Datenbanken, Universität Leipzig From an Linguistic point of views… * C⊈A, C⊈B, AB ~R B
4. Relation Type Determination4.2 Classification WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • From the English point of view… • Closed form • database, playground, blackbird • Hyphened form • bus-driver, single-minded, small-appliance industry • Open form • web space, container ship, computer scientist • From a POS point of view… • noun-noun, adjective-noun, verb-verb, …
4. Relation Type Determination4.3 First Conclusions WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • From the knowledge now gained, we can enrich correspondences in schemas in two ways: • Set the relation type to is-a instead of equal (1) • Remove or at least doubt an existing correspondence (2) • For (1) we assume that AB ⊂ B • (cookbook, book, 0.8, equal) → (cookbook, book, 0.8, is-a) • For (2) we assume that If A is not a word in AB, the correspondence is likely to be false: • (stool, tool, 0.9, equal) → false? • (refund, fund, 0.7, equal) → false? • (discharge, charge, 0.7, equal) → false?
4. Relation Type Determination4.4 Mismatches WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • A word changed its spelling over the centuries: • butterfly (“flutter-by”, “beat fly”, …) • Weiße Elster (from Czech: alstra = water) • A compound is of literal meaning (metaphor): • Completely different meaning • computer mouse, gravy train, buttercup • Obvious origin (in a broad sense being related): • airport, birdhouse, downtown, snowman
4. Relation Type Determination4.4 Mismatches WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Inaccuracies in (vernacular) language • e.g., in biology: strawberry, blackberry, raspberry etc. • Neither is a berry in the biological sense • (yet tomato, banana, grape, pumpkin, melon etc. are)
4. Relation Type Determination4.4 Mismatches WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • For detecting the relation type, the mismatch problem has no negative effect on the mapping • The correspondence is wrong after all • (buttercup, cup, equal) is as wrong as(buttercup, cup, is-a) • Enrichment has no negative effect on the mapping per se • Still, enhanced methods can be used to reduce the mismatches
5. Implementation5.1 Goals WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Specify the following relation types on linguistic methods: equal (default), is-a, inverse is-a • Missing: part-of and overlap • English and German language • Main focus on English language • Possibly apply mapping repair • Remove correspondences that seem clearly wrong • Test & Evaluation
5. Implementation5.1 Goals WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • First concentrate on the element level • Use linguistic knowledge as presented before • Different cases to be distinguished • Single items vs. itemizations
5. Implementation5.2 Cases WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Simple Case (1:1) • Source and target node consist of one item • blackboard ↔ board • high school ↔ school • international database conference ↔ conference
5. Implementation5.2 Cases WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Complex Cases (1:n, n:1, n:m) • Source/target node consist of several item • blackboard, whiteboard ↔ board • wine ↔ white wine, red wine • beer, wine ↔ wine • computers, laptops ↔ computers
5. Implementation5.3 Node Level vs. Path level WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Relation type depends on the perspective… • Node level vs. Path level • Relation is often… • is-a on node level • equal on path level
5. Implementation5.3 Node Level vs. Path level WDI-Lab, Abteilung für Datenbanken, Universität Leipzig
5. Implementation5.4 Requirements WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Benchmarks / Gold Standards (English language) • Manually defined • Dictionary / Thesauri • More-specific data structure • Correspondence: source node, target node, confidence, type • Node: A list of items • Item: A list of word • Word: single word vs. compound
5. Implementation5.5 Generating Benchmarks WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Benchmarks • More difficult than in standard mappings • In some cases even for humans difficult to decide • Birdhouse is a house? • Airport is a port? • How to judge correspondences in an evaluation? • car = bike → FALSE • car = auto → TRUE • motorbike ⊂ bike → ?
5. Implementation5.6 Challenges WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Exocentric compounds • Airport, buttercup, saw tooth, … • Compounds in itemizations • (French wine, German wine — French wine) inverse is-a • (French wine, German wine — European wine)is-a • (French wine, German wine — Mosel wine) overlap • (French wine, German wine — Italian wine) mismatch
5. Implementation5.6 Challenges WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Plurals • (Christian churches — church) • (red wine, white wine — wines) • Short forms • Infant colic — colic (equal instead of is-a) • Node Level vs. Path Level • Compound extending/skipping levels in the schema
5. Implementation5.6 Challenges WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Limited recall • Strong dependency to input (mapping) • Some is-a relations cannot be detected with simple linguistic methods • (car, vehicle) • (wine, beverage) • (cell phones, communication devices)
6. First Results WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Web ↔ Yahoo • 421 Correspondences • 68 subset-correspondences • Found 50 subset-relations, with 34 being correct • Recall: 50.0 % • Precision: 68.0 % • f-Measure: 59.0 %
6. First Results WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Google Health ↔ Yahoo Health (excerpt) • 396 Correspondences • 31 subset-correspondences • Found 20 subset-relations, with 15 being correct • Recall: 48.3 % • Precision: 75.0 % • f-Measure: 61.6 %
6. First Results WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Main issues observed… • Imprecise labels • infant colic — colic (equal) • Uterine-Fibroids —Uterus.Fibroids(equal) • picture frames — frames (equal in field “arts”) • Node-Path-Discrepancies • “No-Compound”-Subsets • vehicle — car (isa)
7. Conclusions WDI-Lab, Abteilung für Datenbanken, Universität Leipzig • Mapping Enrichment • Relation type • Simple vs. complex correspondences • Transformation rules • Relation Type Determination • Linguistic approach on element level • Compounds, itemizations • Advanced methods • Instance level, background knowledge etc. • Increase recall, keep up precision
Discussion WDI-Lab, Abteilung für Datenbanken, Universität Leipzig ThankYou!