1 / 16

Michael Roth

Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information. Michael Roth. Sabine Schulte im Walde Universität Stuttgart. Overview. Motivation / Introduction Data-intensive lexical semantics Corpus-based descriptions

jered
Download Presentation

Michael Roth

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information Michael Roth Sabine Schulte im Walde Universität Stuttgart

  2. Overview • Motivation / Introduction • Data-intensive lexical semantics • Corpus-based descriptions • Semantic Associations • Our Work • Evaluation of data-driven models • Cross-comparison between resources • Summary / Conclusions

  3. Data-intensive lexical semantics • Modelling word meaning • Using meaning aspects • Automatically obtainable • Goal: Determine (dis)similarity of words • Applications: • Word sense discrimination • Anaphora resolution • ...

  4. Corpus-based Descriptions • Disadvantage: Corpus co-occurrence does not cover all aspects of word meaning • Especially world knowledge • Our question: Can we find complementing information in other resources? • Dictionaries? • Encyclopaedias?

  5. Dictionary and Encyclopaedia • Consider other resources: • Dictionaries contain detailed information about word senses • Encyclopaedias written knowledge compendiums • How to identify meaning aspects? • In our work, we rely on semantic associations

  6. Semantic Associations • Definition: • We define semantic associations as concepts spontaneously called to mind by other concepts (stimuli) • Assumption: • Evoked words reflect highly salient linguistic and conceptual features

  7. Data Collection: Verb Stimuli • Associates to verb stimuli • Web experiment • 330 verb stimuli • 30 seconds per verb

  8. Data Collection: Noun Stimuli • Associates to noun stimuli • Offline experiment • 409 noun stimuli • 3 associates per noun

  9. Knowledge Resources • Corpus data • German newspaper corpus • ~200 mio. words • Dictionary: WDG (Wörterbuch der deutschen Gegenwartssprache) • Freely available dictionary (130,000 entries) • Average of 840 words/entry • Encyclopedia: Wikipedia • Free online encyclopedia (650,000 articles) • Average of 1,164 words/article

  10. Analysis: Vorgehensweise • Corpus data • Extract co-occurrence windows of stimuli • Check windows for associations • WDG / Wikipedia • Download stimuli entries • Check content for associations • Missing entries: • WDG - 7%/0% • Wikipedia - 2%/54%

  11. Analysis: Resource Coverage • Noun + associate (all) • Verb + associate (all) 1.2 2.3 1.8 1.2 2.0 1.7 • Resources differ in ... • coverage per stimuli part-of-speech • token/type ratio • proportions per associate‘s part-of-speech (next slide)

  12. Analysis: Resource Coverage (2) Proportions per associate‘s part-of-speech: • Noun stimuli • Corpus – 88%V >84%N >83%Adj • WDG – 43% V >31% Adj >26% N • Wikipedia – 49% N > 39%Adj > 37% V • Verb stimuli • Corpus – 91%Adv > 79% V >77% Adj >76%N • WDG – 29% Adv > 28% V >25%N>24%Adj • Wikipedia – 12%N >9%Adj/Adv > 6%V

  13. Analysis: Cross-Comparison • Noun + associate • World knowledge? • Only in WDG/Wiki: carrot – orange, cry – tears, ... • Only in Corpus: igloo – eskimo, teach – school, ... • Verb + associate

  14. Summary / Conclusions • Analysis of associations across resources • Results: • Different coverage per stimuli (noun vs. verb) • Different (predominant) PoS in word descriptions • Different strength of semantic relatedness • Resources complement each other => A combination of resources should be helpful for modelling word meaning and similarity

  15. Questions?

More Related