Word Sense Disambiguation

Word Sense Disambiguation Foundations of Statistical NLP Chapter 7 2002. 1. 18. Kyung-Hee Sung

Contents • Methodological Preliminaries • Supervised Disambiguation • Bayesian classification / An information-theoretic approach • Disambiguation based on dictionaries, thesauri and bilingual dictionaries • One sense per discourse, one sense per collocation • Unsupervised Disambiguation

Introduction • Word sense disambiguation • Many words have several meanings or senses. • Many words have different usages. Ex) Butter may be used as noun, or as a verb. • The task of disambiguation is done by looking at the context of the word’s use.

Methodological Preliminaries (1/2)

Methodological Preliminaries (2/2) • Pseudowards : artificial ambiguous words • In order to test the performance of disambiguation algorithms. Ex) banana-door • Performance estimation • Upper bound : human performance • Lower bound (baseline) : the performance of the simplest possible algorithm, usually the assignment of all contexts to the most frequent sense.

Supervised Disambiguation

Notations

Bayesian classification (1/2) • Assumption : we have a corpus where each use of ambiguous words is labeled with its correct sense. • Bayes decision rule : minimizes the probability of errors • Decide s´ if P(s´| c) > P(sk| c) ← using Bayes’s rule ← P(c) is constant for all senses

Bayesian classification (2/2) • Naive Bayes independence assumption • All the structure and linear ordering of words within the context is ignored. → bag of words model • The presence of one word in the bag is independent of another. • Decision rule for Naive Bayes • Decide s´ if ← computed by MLE

An Information-theoretic approach (1/2) • The Flip-Flop algorithm applied to finding indicators for disambiguation between two senses.

An Information-theoretic approach (2/2) • To translate prendre (French) based on its object • Translation, {t1, … tm} = { take, make, rise, speak } • Values of Indicator, {x1, … xm} = { mesure, note, exemple, décision, parole }

Dictionary-based disambiguation (1/2) • A word’s dictionary definitions are good indicators for the senses they define.

Dictionary-based disambiguation (2/2) • Simplified example : ash • The score is number of words that are shared by the sense definition and the context.

Thesaurus-based disambiguation (1/2) • Semantic categories of the words in a context determine the semantic categories of the context as a whole, and that this category in turn determines which word senses are used. • Each word is assigned one or more subject codes in the dictionary. • t(sk) : subject code of sense sk. • The score is the number of words that compatible with the subject code of sense sk.

Thesaurus-based disambiguation (2/2) • Self-interest (advantage) is not topic-specific. • When a sense is spread out over several topics, the topic-based classification algorithm fails.

Disambiguation based on translations in a second-language corpus • In order to disambiguate an occurrence of interest in English (first language), we identify the phrase it occurs in and search a German (second language) corpus for instances of the phrase. • The English phrase showed interest: show(E) → ‘zeigen’(G) • ‘zeigen’(G) will only occur with Interesse(G) since ‘legal shares’ are usually not shown. • We can conclude that interest in the phrase to show interest belongs to the sense attention, concern.

One sense per discourse constraint • The sense of a target word is highly consistent within any given document. • If the first occurrence of plant is a use of the sense ‘living being’, then later occurrences are likely to refer to living beings too.

One sense per collocation constraint • Word senses are strongly correlated with certain contextual features like other words in the same phrasal unit. • Collocational features are ranked according to the ratio: • Relying on only the strongest feature has the advantage that no integration of different sources of evidence is necessary. ← The number of occurrences of sense sk with collocation fm

Unsupervised Disambiguation (1/3) • There are situations in which even such a small amount of information is not available. • Sense tagging requires that some characterization of the senses be provided. However, sense discrimination can be performed in a completely unsupervised fashion. • Context-group discrimination : a completely unsupervised algorithm that clusters unlabeled occurrences.

An EM algorithm (1/2) 1.Initialize the parameters of the model μ randomly. Compute the log likelihood of the corpus C 2. While l(C|μ) is improving repeat: (a) E-step. Estimatehik ← Naive Bayes assumption

An EM algorithm (2/2) (b) M-step. Re-estimate the parameters P(vj|sk) and P(sk) by way of MLE

Unsupervised Disambiguation (2/3) • Unsupervised disambiguation can be easily adapted to produce distinctions between usage types. • Ex) The distinction between physical bank ( in the context of bank robberies ) banks as abstract corporations ( in the context of corporate mergers ) • The unsupervised algorithm splits dictionary senses into fine-grained contextual variants. • Usually, the induced clusters do not line up well with dictionary senses. Ex) ‘lawsuit’ → ‘civil suit’, ‘criminal suit’

Unsupervised Disambiguation (3/3) • Infrequent senses and senses that have few collocations are hard to isolate in unsupervised disambiguation. • Results of the EM algorithm • The algorithm fails for words whose senses are topic-independent such as ‘to teach’ for train. ← for ten experiments with different initial conditions

Word Sense Disambiguation