Word Sense Disambiguation

Foundation of Statistical Natural Language Processing Word Sense Disambiguation 2014.05.10 Minho Kim (karma@pusan.ac.kr)

Motivation • Computationally determining which sense of a word is activated by its use in a particular context. • E.g. I am going to withdraw money from the bank. • One of the central challenges in NLP. • Needed in: • Machine Translation: For correct lexical choice. • Information Retrieval: Resolving ambiguity in queries. • Information Extraction: For accurate analysis of text.

Senses and ambiguity • Many words have different meanings (senses) in different contexts • E.g. Bank  river bank; financial institution • The problem in general is more complicated by the fact that the “senses” of a particular word are just subtly different.

Homonym and Polysemy

POS Tagging • Some words are used in different parts of speech • “They're waiting in line at the ticket office.”  Noun • “You should line a coat with fur”  Verb • The techniques used for tagging and senses disambiguation are bit different. • For tagging the local context is heavily used – looking at the use of determiners and predicates and the like. • For word sense disambiguation the techniques look at a broader context of the word. Tagging is explored in Chapter 10.

Methodological Preliminaries • Corpus Based Approaches • Rely on corpus evidence. • Supervised and unsupervised learning • Train a model using tagged or untagged corpus. • Probabilistic/Statistical models. • Knowledge Based Approaches • Rely on knowledge resources like WordNet, Thesaurus etc. • May use grammar rules for disambiguation. • May use hand coded rules for disambiguation. • Hybrid Approaches • Use corpus evidence as well as semantic relations form WordNet.

Corpus Based Approaches • Supervised and unsupervised learning • In supervised learning, we know the actual “sense” of a word which is labeled • Supervised learning tends to be a classification task • Unsupervised tends to be a clustering task • Providing labeled corpora is expensive • Knowledge sources to help with task • Dictionaries, thesaurus, aligned bilingual texts

Pseudowords • When one has difficulty coming up with sufficient training and test data, one techniques is to create “pseudowords” from an existing corpora. • For e.g. replace banana and door with the pseudoword “banana-door”. • The ambiguous set is the text with pseudowords. • The disambiguated set is the original text.

Upper and lower bounds • Upper and lower bounds on performance • Upper bound is usually defined as human performance • Lower bound is given by the simplest possible algorithm • Most Frequent Class • Naïve Bayes • Evaluation measure • Precision, Recall, F-measure

Supervised Disambiguation

Classification and Clustering Class A Model A Class B B C Class C

Sense Tagged Corpus <p> BSAA0011-00018403 서양에만 서양/NNG + 에/JKB + 만/JX BSAA0011-00018404 젤리가 젤리/NNG + 가/JKS BSAA0011-00018405 있는 있/VV + 는/ETM BSAA0011-00018406 것이 것/NNB + 이/JKC BSAA0011-00018407 아니라 아니/VCN + 라/EC BSAA0011-00018408 우리 우리/NP BSAA0011-00018409 나라에서도 나라/NNG + 에서/JKB + 도/JX BSAA0011-00018410 앵두 앵두/NNG BSAA0011-00018411 사과 사과__05/NNG BSAA0011-00018412 모과 모과__02/NNG BSAA0011-00018413 살구 살구/NNG BSAA0011-00018414 같은 같/VA + 은/ETM BSAA0011-00018415 과일로 과일__01/NNG + 로/JKB BSAA0011-00018416 '과편'을 '/SS + 과편/NNG + '/SS + 을/JKO BSAA0011-00018417 만들어 만들/VV + 어/EC BSAA0011-00018418 먹었지만 먹__02/VV + 었/EP + 지만/EC BSAA0011-00018419 수박은 수박__01/NNG + 은/JX BSAA0011-00018420 물기가 물기/NNG + 가/JKS BSAA0011-00018421 너무 너무/MAG BSAA0011-00018422 많고 많/VA + 고/EC BSAA0011-00018423 펙틴질이 펙틴질/NNG + 이/JKS BSAA0011-00018424 없어 없/VA + 어/EC BSAA0011-00018425 가공해 가공__01/NNG + 하/XSV + 아/EC BSAA0011-00018426 먹지 먹__02/VV + 지/EC BSAA0011-00018427 못했다. 못하/VX + 았/EP + 다/EF + ./SF </p>

Notational Conventions

Supervised task • The idea here is that there is a training set of exemplars which tags each word that needs to be disambiguated with the correct “sense” of the word. • The task is to correctly classify the word sense in the testing set using the statistical properties gleaned from the training set for the occurrence of the word in a particular context • This chapter explores two approaches to this problem • Bayesian approach and information theoretic approach

Bayesian Classification

Prior Probability • Prior probability: the probability before we consider any additional knowledge

Conditional probability • Sometimes we have partial knowledge about the outcome of an experiment • Conditional (or Posterior) Probability • Suppose we know that event B is true • The probability that A is true given the knowledge about B is expressed by

http://ai.stanford.edu/~paskin/gm-short-course/lec1.pdf

Conditional probability (cont) • Note: P(A,B) = P(A ∩ B) • Chain Rule • P(A, B) = P(A|B) P(B) = The probability that A and B both happen is the probability that B happens times the probability that A happens, given B has occurred. • P(A, B) = P(B|A) P(A) = The probability that A and B both happen is the probability that A happens times the probability that B happens, given A has occurred. • Multi-dimensional table with a value in every cell giving the probability of that specific state occurring

Bayes' rule Chain Rule  Bayes' rule P(A,B) = P(A|B)P(B) = P(B|A)P(A) Useful when one quantity is more easy to calculate; trivial consequence of the definitions we saw but it’ s extremely useful

Bayes' rule Bayes' rule translates causal knowledge into diagnostic knowledge. For example, if A is the event that a patient has a disease, and B is the event that she displays a symptom, then P(B | A) describes a causal relationship, and P(A | B) describes a diagnostic one (that is usually hard to assess). If P(B | A), P(A) and P(B) can be assessed easily, then we get P(A | B) for free.

Example • S:stiff neck, M: meningitis • P(S|M) =0.5, P(M) = 1/50,000 P(S)=1/20 • I have stiff neck, should I worry?

(Conditional) independence • Two events A e B are independent of each other if P(A) = P(A|B) • Two events A and B are conditionally independent of each other given C if P(A|C) = P(A|B,C)

Back to language • Statistical NLP aims to do statistical inference for the field of NLP • Topic classification • P( topic | document ) • Language models • P (word | previous word(s) ) • WSD • P( sense | word) • Two main problems • Estimation: P in unknown: estimate P • Inference: We estimated P; now we want to find (infer) the topic of a document, o the sense of a word

Language Models (Estimation) • In general, for language events, P is unknown • We need to estimate P, (or model M of the language) • We’ll do this by looking at evidence about what P must be based on a sample of data

Estimation of P • Frequentist statistics • Parametric • Non-parametric (distribution free) • Bayesian statistics • Bayesian statistics measures degrees of belief • Degrees are calculated by starting with prior beliefs and updating them in face of the evidence, using Bayes theorem • 2 different approaches, 2 different philosophies

Inference • The central problem of computational Probability Theory is the inference problem: • Given a set of random variables X1, … , Xk and their joint density P(X1, … , Xk), compute one or more conditional densities given observations. • Compute • P(X1 | X2 … , Xk) • P(X3 | X1 ) • P(X1 , X2 | X3, X4,) • Etc … • Many problems can be formulated in these terms.

Bayes decision rule • w: ambiguous word • S = {s1, s2, …, sn } senses for w • C = {c1, c2, …, cn } context of w in a corpus • V = {v1, v2, …, vj } words used as contextual features for disambiguation • Bayes decision rule • Decide sj if P(sj | c) > P(sk | c) for sj ≠sk • We want to assign w to the sense s’ where s’ = argmaxskP(sk | c)

Bayes classification for WSD • We want to assign w to the sense s’ where s’ = argmaxskP(sk | c) • We usually do not know P(sk | c) but we can compute it using Bayes rule

Naïve Bayes classifier • Naïve Bayes classifier widely used in machine learning • Estimate P(c | sk) and P(sk)

Naïve Bayes classifier • Estimate P(c | sk) and P(sk) • w: ambiguous word • S = {s1, s2, …, sn } senses for w • C = {c1, c2, …, cn } context of w in a corpus • V = {v1, v2, …, vj } words used as contextual features for disambiguation • Naïve Bayes assumption:

Naïve Bayes classifier • Naïve Bayes assumption: • Two consequences • All the structure and linear ordering of words within the context is ignored bags of words model • The presence of one word in the model is independent of the others • Not true but model “easier” and very “efficient” • “easier” “efficient” mean something specific in the probabilistic framework • We’ll see later (but easier to estimate parameters and more efficient inference) • Naïve Bayes assumption is inappropriate if there are strong dependencies, but often it does very well (partly because the decision may be optimal even if the assumption is not correct)

Bayes decision rule Naïve Bayes assumption Count of vj when sk Estimation Prior probability of sk Naïve Bayes for WSD

Naïve Bayes Algorithm for WSD • TRAINING (aka Estimation) • For all of senses sk of w do • For all words vj in the vocabulary calculate • end • end • For all of senses sk of w do • end

Naïve Bayes Algorithm for WSD • TESTING (aka Inference or Disambiguation) • For all of senses sk of w do • For all words vj in the context window c calculate • end • end • Choose s= sk of w do

An information-theoretic approach

Information theoretic approach • Look for key words (informant) that disambiguates the sense of the word.

Flip-Flop Algorithm t  translations for the ambiguous word x  possible values for indicators The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant.

Stepping through flip-flop algorithm for the French word: prendre

Disambiguation process • Once the partition set for P& Q (indicator words determined), then disambiguation is simple: • For every occurrence of the ambiguous word, determine the value of xi – the indicator word. • If xi is in Q1, assign it to sense 1; if not assign it to sense 2.

Decision Lists

Decision Lists and Trees • Very widely used in Machine Learning. • Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988). • Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word. • List decides between two senses after one positive answer • Tree allows for decision among multiple senses after a series of answers • Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes. • More descriptive and easier to interpret.

Decision List for WSD (Yarowsky, 1994) • Identify collocationalfeatures from sense tagged data. • Word immediately to the left or right of target : • I have my bank/1 statement. • The river bank/2 is muddy. • Pair of words to immediate left or right of target : • The world’s richest bank/1 is here in New York. • The river bank/2 is muddy. • Words found within k positions to left or right of target, where k is often 10-50 : • My credit is just horrible because my bank/1 has made several mistakes with my account and the balance is very low.

Building the Decision List • Sort order of collocation tests using log of conditional probabilities. • Words most indicative of one sense (and not the other) will be ranked highly.

Computing DL score • Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense) • P(S=1) = 1,500/2,000 = .75 • P(S=2) = 500/2,000 = .25 • Given “credit” occurs 200 times with bank/1 and 4 times with bank/2. • P(F1=“credit”) = 204/2,000 = .102 • P(F1=“credit”|S=1) = 200/1,500 = .133 • P(F1=“credit”|S=2) = 4/500 = .008 • From Bayes Rule… • P(S=1|F1=“credit”) = .133*.75/.102 = .978 • P(S=2|F1=“credit”) = .008*.25/.102 = .020 • DL Score = abs (log (.978/.020)) = 3.89

Using the Decision List • Sort DL-score, go through test instance looking for matching feature. First match reveals sense…

Using the Decision List

Support Vector Machine(SVM)

Ch. 15 Linear classifiers: Which Hyperplane? • Lots of possible solutions for a, b, c. • Some methods find a separating hyperplane, but not the optimal one [according to some criterion of expected goodness] • E.g., perceptron • Support Vector Machine (SVM) finds an optimal* solution. • Maximizes the distance between the hyperplane and the “difficult points” close to decision boundary • One intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions This line represents the decision boundary: ax+ by− c = 0

Word Sense Disambiguation