110 likes | 224 Views
WSD using Optimized Combination of Knowledge Sources. Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu. Introduction. Regular approaches All words Sample (small trial section) Problems Ambiguity, especially at fine granularity
E N D
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu
Introduction • Regular approaches • All words • Sample (small trial section) • Problems • Ambiguity, especially at fine granularity • New senses in text that are not in dictionary
Approach • Integrates partial sources of information • Part-of-speech • Dictionary definitions • Pragmatic codes • Selectional restrictions • Integration • Filters • Partial selectors (taggers)
Dictionary for senses • Longman Dictionary of Contemporary English (LDOCE) • Two levels: • Homograph • Sense
Methodology • Preprocessing • Part-of-speech tagger (Brill) • Part-of-speech • Filter – eliminate all incompatible homographs • If no sense remains – keep all senses
Methodology (cont.) • Dictionary definitions • Partial tagger: • Count number of words that appear both in definition and the context • Normalize by the length of the definition • Return a list of candidate senses
Methodology (cont.) • Pragmatic codes • Partial tagger - Uses the hierarchy of LDOCE pragmatic codes (subject area) • Modified simulated annealing • Optimize the number of pragmatic codes of the same type in the sentence • Whole paragraph - Only for nouns ?
Methodology (cont.) • Selectional Restrictions • Filter • LDOCE senses – 35 semantic classes (H = human, M = human male, P = plant, etc) • Nouns – their type, adjs – the type of the object they modify, adv – type of their modifier, verbs – types of S, DO, IO
Methodology (cont.) • Combine knowledge sources • Decision lists • Can assign sense to unknown words, if there is a definition in LDOCE
Evaluation • Create a corpus based on SemCor (200,000 words; tagged with WordNet senses) • SENSUS – merging between LDOCE and WordNet (for Machine Translation) • Still ambiguity • 36,869 out of 85,747 words (personal opinion: strongly biased)
Baseline: 49.8% 70% of the 1st sense – correctly tagged 83.4% accuracy = 92.8% accuracy on all words (!!!) Test by voting: Results