160 likes | 331 Views
Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information. Michael Roth. Sabine Schulte im Walde Universität Stuttgart. Overview. Motivation / Introduction Data-intensive lexical semantics Corpus-based descriptions
E N D
Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information Michael Roth Sabine Schulte im Walde Universität Stuttgart
Overview • Motivation / Introduction • Data-intensive lexical semantics • Corpus-based descriptions • Semantic Associations • Our Work • Evaluation of data-driven models • Cross-comparison between resources • Summary / Conclusions
Data-intensive lexical semantics • Modelling word meaning • Using meaning aspects • Automatically obtainable • Goal: Determine (dis)similarity of words • Applications: • Word sense discrimination • Anaphora resolution • ...
Corpus-based Descriptions • Disadvantage: Corpus co-occurrence does not cover all aspects of word meaning • Especially world knowledge • Our question: Can we find complementing information in other resources? • Dictionaries? • Encyclopaedias?
Dictionary and Encyclopaedia • Consider other resources: • Dictionaries contain detailed information about word senses • Encyclopaedias written knowledge compendiums • How to identify meaning aspects? • In our work, we rely on semantic associations
Semantic Associations • Definition: • We define semantic associations as concepts spontaneously called to mind by other concepts (stimuli) • Assumption: • Evoked words reflect highly salient linguistic and conceptual features
Data Collection: Verb Stimuli • Associates to verb stimuli • Web experiment • 330 verb stimuli • 30 seconds per verb
Data Collection: Noun Stimuli • Associates to noun stimuli • Offline experiment • 409 noun stimuli • 3 associates per noun
Knowledge Resources • Corpus data • German newspaper corpus • ~200 mio. words • Dictionary: WDG (Wörterbuch der deutschen Gegenwartssprache) • Freely available dictionary (130,000 entries) • Average of 840 words/entry • Encyclopedia: Wikipedia • Free online encyclopedia (650,000 articles) • Average of 1,164 words/article
Analysis: Vorgehensweise • Corpus data • Extract co-occurrence windows of stimuli • Check windows for associations • WDG / Wikipedia • Download stimuli entries • Check content for associations • Missing entries: • WDG - 7%/0% • Wikipedia - 2%/54%
Analysis: Resource Coverage • Noun + associate (all) • Verb + associate (all) 1.2 2.3 1.8 1.2 2.0 1.7 • Resources differ in ... • coverage per stimuli part-of-speech • token/type ratio • proportions per associate‘s part-of-speech (next slide)
Analysis: Resource Coverage (2) Proportions per associate‘s part-of-speech: • Noun stimuli • Corpus – 88%V >84%N >83%Adj • WDG – 43% V >31% Adj >26% N • Wikipedia – 49% N > 39%Adj > 37% V • Verb stimuli • Corpus – 91%Adv > 79% V >77% Adj >76%N • WDG – 29% Adv > 28% V >25%N>24%Adj • Wikipedia – 12%N >9%Adj/Adv > 6%V
Analysis: Cross-Comparison • Noun + associate • World knowledge? • Only in WDG/Wiki: carrot – orange, cry – tears, ... • Only in Corpus: igloo – eskimo, teach – school, ... • Verb + associate
Summary / Conclusions • Analysis of associations across resources • Results: • Different coverage per stimuli (noun vs. verb) • Different (predominant) PoS in word descriptions • Different strength of semantic relatedness • Resources complement each other => A combination of resources should be helpful for modelling word meaning and similarity