Michael Roth

Comparing Corpus Co-Occurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information Michael Roth Sabine Schulte im Walde Universität Stuttgart

Overview • Motivation / Introduction • Data-intensive lexical semantics • Corpus-based descriptions • Semantic Associations • Our Work • Evaluation of data-driven models • Cross-comparison between resources • Summary / Conclusions

Data-intensive lexical semantics • Modelling word meaning • Using meaning aspects • Automatically obtainable • Goal: Determine (dis)similarity of words • Applications: • Word sense discrimination • Anaphora resolution • ...

Corpus-based Descriptions • Disadvantage: Corpus co-occurrence does not cover all aspects of word meaning • Especially world knowledge • Our question: Can we find complementing information in other resources? • Dictionaries? • Encyclopaedias?

Dictionary and Encyclopaedia • Consider other resources: • Dictionaries contain detailed information about word senses • Encyclopaedias written knowledge compendiums • How to identify meaning aspects? • In our work, we rely on semantic associations

Semantic Associations • Definition: • We define semantic associations as concepts spontaneously called to mind by other concepts (stimuli) • Assumption: • Evoked words reflect highly salient linguistic and conceptual features

Data Collection: Verb Stimuli • Associates to verb stimuli • Web experiment • 330 verb stimuli • 30 seconds per verb

Data Collection: Noun Stimuli • Associates to noun stimuli • Offline experiment • 409 noun stimuli • 3 associates per noun

Knowledge Resources • Corpus data • German newspaper corpus • ~200 mio. words • Dictionary: WDG (Wörterbuch der deutschen Gegenwartssprache) • Freely available dictionary (130,000 entries) • Average of 840 words/entry • Encyclopedia: Wikipedia • Free online encyclopedia (650,000 articles) • Average of 1,164 words/article

Analysis: Vorgehensweise • Corpus data • Extract co-occurrence windows of stimuli • Check windows for associations • WDG / Wikipedia • Download stimuli entries • Check content for associations • Missing entries: • WDG - 7%/0% • Wikipedia - 2%/54%

Analysis: Resource Coverage • Noun + associate (all) • Verb + associate (all) 1.2 2.3 1.8 1.2 2.0 1.7 • Resources differ in ... • coverage per stimuli part-of-speech • token/type ratio • proportions per associate‘s part-of-speech (next slide)

Analysis: Resource Coverage (2) Proportions per associate‘s part-of-speech: • Noun stimuli • Corpus – 88%V >84%N >83%Adj • WDG – 43% V >31% Adj >26% N • Wikipedia – 49% N > 39%Adj > 37% V • Verb stimuli • Corpus – 91%Adv > 79% V >77% Adj >76%N • WDG – 29% Adv > 28% V >25%N>24%Adj • Wikipedia – 12%N >9%Adj/Adv > 6%V

Analysis: Cross-Comparison • Noun + associate • World knowledge? • Only in WDG/Wiki: carrot – orange, cry – tears, ... • Only in Corpus: igloo – eskimo, teach – school, ... • Verb + associate

Summary / Conclusions • Analysis of associations across resources • Results: • Different coverage per stimuli (noun vs. verb) • Different (predominant) PoS in word descriptions • Different strength of semantic relatedness • Resources complement each other => A combination of resources should be helpful for modelling word meaning and similarity

Questions?

Michael Roth

Michael Roth

Presentation Transcript

Roth Revolution

Philip Roth

Dieter Roth

Roth IRA

Emily Martinez Michael Minnehan Jessica Ortiz David Roth

Csaba ROTH

Rudolf Roth, Michael Smirnov, Henning Sanneck, Dorota Witaszek GMD FOKUS

Jeff Roth

Roth Iras

ROTH IRAs

Roth IRA’s

Reuben Roth

gold roth ira

roth embru

Sid Roth

Sid Roth

Sid Roth

Sid Roth

Noam Roth