630 likes | 768 Views
COMP791A: Statistical Language Processing. Word Sense Disambiguation Chap. 7. Overview of the problem. Many words have several meanings or senses (homonyms or polysemous words) Ex: “chair” --> furniture or person Ex: “dishes” --> plates or food
E N D
COMP791A: Statistical Language Processing Word Sense Disambiguation Chap. 7
Overview of the problem • Many words have several meanings or senses (homonyms or polysemous words) • Ex: “chair” --> furniture or person • Ex: “dishes” --> plates or food • Need to determine which sense of a word is used in a specific sentence • Note: • often, the different senses of a word are closely related • Ex: “title” --> right of legal ownership, document that is evidence of the legal ownership, name of work,… • often, several senses can be “activated” in a single context (co-activation) • Ex: “This could bring competition to the trade” • Competition --> the act of competing AND the people who are competing
Word Sense Disambiguation (WSD) • To determine which of the senses of an ambiguous word is invoked in a particular use of the word. • Potentially extremely useful problem • Ex: in machine translation… • “chair” --> (person) “directeur” • “chair” --> (furniture) “chaise” • “bureau” --> “desk” • “bureau” --> “office” • Can be done: • with rule-based methods • with statistical methods
WordNet • most widely-used lexical database for English • free! • G. Miller at Princeton www.cogsci.princeton.edu/~wn • used in many applications of NLP • EuroWorNet • Dutch, Italian, Spanish, German, French, Czech and Estonian • includes entries for open-class words only (nouns, verbs, adjectives & adverbs)
WordNet Entries • in WordNet 1.6 (now 2.0): • 118,000 different word forms • organized according to their meanings (senses) • each entry has • a dictionary-style definition (gloss) of each sense • AND a set of domain-independent lexical relations among • WordNet’s entries (words) • senses • sets of synonyms • grouped into synsets (i.e. sets of synonyms)
Rule-based WSD • They served green-lipped mussels from New Zealand. • Which airlines serveDenver? • semantic restrictions on the predicate of an argument • argument mussels: --> needs a predicate with the sense {provide-food} --> sense 6 of WordNet • argument Denver: --> needs a predicate with the sense {attend-to} --> sense 10 of WordNet
Rule-based WSD • In our house, everybody has a career and none of them includes washingdishes. • In her tiny kitchen, Ms. Chen works efficiently, stir-frying several simple dishes, including braised pig’s ears and chicken livers with green peppers. • semantic restrictions on the argument of a predicate • predicate wash: --> needs an argument with the sense {object} --> senses 1, 2 or 6 form WordNet • predicate stir-fry: --> needs an argument with the sense {food} --> sense 2 of WordNet
Problem with rule-based WSD • In some cases, the constraints on the predicate and on the argument are not enough to pinpoint one unique sense • ex: “What kind of dishes do you recommend?” • Figures of speech • meaning of words can be generated dynamically • instead of being fixed and stored in a lexicon or set of selectional restrictions • Ex: metaphor, metonymy
Problem with rule-based WSD (con’t) • Metaphor: • using words / phrases whose meaning are appropriate to different kinds of concepts • suggesting a likeness or analogy between them • This deal does not scare Microsoft. • scare has 2 senses in WordNet: • to cause fear • to cause to lose courage • metaphor: the corporation is viewed as a person • She is drowning in money • metaphor: money is viewed as a liquid
Problem with rule-based WSD (con’t) • Metonymy: • referring to a concept by naming some other concept closely related to it • We await word from the crown. • a monarch is not the same thing as a crown • but we often refer to the monarch as "the crown" because the two are associated • Metonymy : the crown refers to the monarch • The White House had no comment. • Metonymy : The White House refers to the administration
WSD versus POS tagging • “butter” can be a verb or noun • “I should butter my toasts.” • “I like butter on my toasts.” • 2 different POS --> 2 different usages with 2 different meanings • So WSD can be viewed as POS tagging (classifying using semantic tags rather than POS tags) • But the 2 tasks are considered different… because: • nearby structural cues (ex: is the previous word a determiner?) • are important in POS tagging • are not effective for WSD • distant content words • are very effective for WSD • are not interesting for POS • So: • in POS tagging, we typically only look at the local context • in WSD, we use content words in a larger context
Approaches to Statistical WSD • Supervised Disambiguation • based on a labeled training set • The learning system has: • a training set of feature-encoded inputs AND • their appropriate sense label (category) • Based on Lexical Resources • use of external lexical resources such as dictionaries and thesauri • Discourse properties • Unsupervised Disambiguation • based on unlabeled corpora • The learning system has: • a training set of feature-encoded inputs BUT • NOT their appropriate sense label (category)
Approaches to Statistical WSD • --> Supervised Disambiguation • Naïve Bayes • Decision Trees • Use of Lexical Resources • Dictionary-based • Thesaurus-based • Translation-based • Discourse properties • Unsupervised Disambiguation
Supervised WSD: Overview • A word is assumed to have a finite number of discrete senses. • The sense of a word depends on the sense of surrounding words • ex: bass = fish, musical instrument, ...
Supervised WSD: Overview (con’t) • WSD is viewed as typical classification problem • use machine learning techniques to train a system • that learns a classifier (a function f) to assign to unseen examples one of a fixed number of senses (categories) • f(input) = correct sense • Input: • Target word: • The word to be disambiguated • Context (feature vector): • a vector of relevant linguistic features that represents its context (ex: a window of words around the target word)
Examples of Feature Vectors • Take a window of n word around the target word • Encode information about the words around the target word • typical features include: words, root forms, POS tags, frequency, … • An electric guitar and bass player stand off to one side, not really part of the scene, just as a sort of nod to gringo expectations perhaps. • with position information • [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ] • no position information, but word frequency • [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band] • [0,0,0,1,0,0,0,0,0,0,1,0] • other features: • [followed by "player", contains "show" in the sentence,…] • [yes, no, … ]
Supervised WSD • Training corpus • Each occurrence of the ambiguous word w is annotated with a semantic label (its contextually appropriate sense sk). • Several approaches from ML • Bayesian classification • Decision trees • Neural networks • K-nearest neighbor (kNN) • …
Approaches to Statistical WSD • --> Supervised Disambiguation • --> Naïve Bayes • Decision Trees • Use of Lexical Resources • Dictionary-based • Thesaurus-based • Translation-based • Discourse properties • Unsupervised Disambiguation
Naïve Bayes Classification • Goal: choose the most probable sense s* for a word given a vector V of surrounding words • vector contains: • frequency of words • vocabulary: [fishing, big, sound, player, fly, rod, …] • [0, 0, 0, 2, 1, 0, …] • Bayes decision rule: • s* = argmaxsk P(sk|V) • where: • S is the set of possible senses for the target word • sk is a sense in S • V is the feature vector (the representation of the context) • Using Bayes rule:
Decision Rule for Naive Bayes But: P(V) is the same for all possible senses, so it does not affect the final ranking of the senses, so we can drop it. To make the computations simpler, we often take the log of probabilities:
Naïve Bayes WSD • Training a Naïve Bayes classifier = estimating P(vj|sk) and P(sk) from a sense-tagged training corpus = finding Maximum-Likelihood Estimation, perhaps with appropriate smoothing Nb of occurrences of feature j over the total nb of features appearing in windows of Sk Nb of occurrences of sense k over nb of all occurrences of ambiguous word
Naïve Bayes Algorithm // 1. training for all senses sk or word w for all words vj in the vocabulary compute for all senses sk of word w compute // 2. disambiguation for all senses sk of word w score(sk) = log P(sk) for all words vj in the context window score (sk) = score (sk) + log P(vj | sk) choose s* = with the greatest score(sk)
Example • Training corpus (context window = 3 words): …Today the World Bank/BANK1 and partnersare calling for greater relief… …Welcome to the Bank/BANK1 of America the nation's leading financial institution… …Welcome toAmerica's Job Bank/BANK1 Visit oursite and… …Web site of theEuropean Central Bank/BANK1 located inFrankfurt… …TheAsian Development Bank/BANK1 ADB amultilateral development finance… …loungingagainst verdant banks/BANK2 carving out the... …for swimming, had warned heroff the banks/BANK2 of thePotomac. Nobody... • Training: • P(the|BANK1) = 5/30 P(the|BANK2) = 3/12 • P(world|BANK1) = 1/30 P(world|BANK2) = 0/12 • P(and|BANK1) = 1/30 P(and|BANK2) = 0/12 • … • P(off|BANK1) = 0/30 P(off|BANK2) = 1/12 • P(Potomac|BANK1) = 0/30 P(Potomac|BANK2) = 1/12 • P(BANK1) = 5/7 P(BANK2) = 2/7 • Disambiguation: “I lost my left shoe on the banks of the river Nile.” • Score(BANK1)=log(5/7) + log(P(shoe|BANK1))+log(P(on|BANK1))+log(P(the|BANK1)) … • Score(BANK2)=log(2/7) + log(P(shoe|BANK2))+log(P(on|BANK2))+log(P(the|BANK2)) …
Naïve Bayes Assumption • Independence assumption: • The features (contextual words) are conditionally independent: • Probability of an entire feature vector given a sense, is the product of the probabilities of its individual features given that sense • Consequences: • Bag of words model: • the structure and linear ordering of words within the context is ignored. • The presence of one word in the bag is independent of another. • The independence assumption is incorrect but is useful in WSD • (Gale, Church & Yarowsky, 1992) report 90% correct disambiguation with 6 ambiguous nouns in the Hansard
Approaches to Statistical WSD • --> Supervised Disambiguation • Naïve Bayes • --> Decision Trees • Use of Lexical Resources • Dictionary-based • Thesaurus-based • Translation-based • Discourse properties • Unsupervised Disambiguation
Decision Tree Classifier • Bayes Classifier uses information from all words in the context window • But some words are more reliable than others to indicate which sense is used…
Decision Tree Classifier (con’t) • Look for features that are very good indicators of the result • Place these features (as questions) in nodes of a decision tree • Split the examples so that those with different values for the chosen feature are in a different set • Repeat the same process with another feature • A sequence of tests is applied to each feature vector • if test succeeds --> return the sense associated with the test • otherwise --> apply the next test • if all features have been tested, then return a default sense (most common one)
Example: bass yes no no yes yes no
Another Example: The restaurant Input • Training data: Output
A first decision tree • But is it the best decision tree we can build?
A better decision tree • 4 tests instead of 9 & 11 branches instead of 21
Choosing the best feature • The key problem is choosing which feature to split a given set of examples • Most used strategy: information theory Entropy (or self-information)
Choosing the best feature (con't) • The "discriminating power" of an attribute A given a set S • if the training set contains: • p positive examples and • n negative examples
Some intuition • Size is the least discriminating attribute (i.e. smallest information gain) • Shape and color are the most discriminating attribute (i.e. highest information gain)
A small example • So first separate according to either color or shape (root of the tree) • Note: by definition 0log0 is 0
The restaurant example • With the data on p.27, we have: • So root of the tree should be attribute Patrons (we gain more information) • do recursively for subtrees
Back to WSD • Need to translate the French word: “Prendre” • can be seen as WSD • possible translations/senses={take, make, rise, speak}
Back to WSD (con't) • (Brown et al., 1991) found: • On Canadian Hansard
Training Set • With supervised methods, we need a large sense-tagged training set… where do you get it from? • Using a "real" training set • Main standard hand sense-tagged corpora: • SEMCOR corpus • portion of the Brown corpus • tagged with WordNet senses • SENSEVAL corpus (www.senseval.org/) • Standard WSD “competition” like MUC, TREC & DUC • Open Mind Word Expert(OMWE) • Using pseudowords: • Artificial ambiguous words created by conflating two or more words. • Ex: occurrences of “banana” and “door” can be replaced by “banana-door” • The disambiguation algorithm can now be tested on this data to disambiguate the pseudoword “banana-door” into either “banana” or “door”
Problems… • With supervised (or unsupervised) methods: • need a large amount of work to create a classifier for each ambiguous word! • So most work based in these techniques, report work on a few words (2 to 12 words) • Scaling up these approaches to deal with all ambiguous words is immense work! • Solution: • use lexical resources (ex: machine-readable dictionaries) • use distributional properties to improve disambiguation: • Ambiguous words are only used in one sense in any given discourse and with any given collocate.
Approaches to Statistical WSD • Supervised Disambiguation • Naïve Bayes • Decision-tree • -->Use of Lexical Resources • --> Dictionary-based • Thesaurus-based • Translation-based • Discourse properties • Unsupervised Disambiguation
WSD based on sense definitions • (Lesk, 1986) • A word’s dictionary definitions are likely to be good indicators for the sense they define. • Method: • Express the dictionary definitions of the ambiguous word as sets of bag-of-words • Express the context of the ambiguous word as a single bag-of-words from the dictionary definitions of the context words. • Choose the definition of the ambiguous word that has the greatest overlap with the words occurring in its context.
Example • "Cone" in dictionary: • DEF-1: “solid body which narrows to a point” • BAG = {body, narrows, point, solid} • DEF-2: “something of this shape whether solid or hollow” • BAG = {hollow, shape, something, solid} • DEF-3: “fruit of certain evergreen tree” • BAG = {evergreen, fruit, tree} • To disambiguate "cone" in "pine cone" • "Pine" in dictionary • DEF-1: “kind of evergreen tree” • DEF-2: “waste away through sorrow or illness” • --> BAG = {evergreen, illness, kind, sorrow, tree, waste} • so "cone" is: • score(DEF-1) = {body, narrows, point, solid} {evergreen, illness, kind, sorrow, tree, waste} = 0 • score(DEF-2) = {hollow,shape,something,solid} {evergreen, illness, kind, sorrow, tree, waste} = 0 • score(DEF-3) = {evergreen, fruit, tree} {evergreen, illness, kind, sorrow, tree, waste} = 2 • Max overlap: DEF-3
The algorithm For all senses sk of word w score(sk) = overlap ( - words in the dictionary definition of sense sk - the union of the words in all context windows that also appear in a definition of w ) pick the sense s* with the highest score(sk)
Analysis • Accuracies of 50-70% on short samples of texts • Problem: • dictionary entries for the target words are usually relatively short • and may not provide sufficient material to create adequate classifiers • Because the words in the context and their definitions must have direct overlap • One solution: • expand the list of words whose definitions make use of the target word • Example: • if “deposit” does not occur in the definition of “bank” • but “bank” occurs in the definition of “deposit” • We can expand the classifier for “bank” to include “deposit” as a relevant feature • However: • just knowing that “deposit” is related to “bank” does not help much • if we do not know to which sense of “bank” it is related to • --> To make use of “deposit” as a feature, we have to know which sense of “bank” was being used in the definition • Solution: • Use a thesaurus…
Approaches to Statistical WSD • Supervised Disambiguation • Naïve Bayes • Decision-tree • -->Use of Lexical Resources • Dictionary-based • --> Thesaurus-based • Translation-based • Discourse properties • Unsupervised Disambiguation
Thesaurus-Based Disambiguation • Thesauri include tags (subject codes) in their entries that correspond to broad semantic categories • Each word is assigned one or more subject codes which corresponds to its different meanings • ANIMAL/INSECT (category 414) • TOOLS/MACHINERY (category 348) • The semantic categories of the words in a context determine the semantic category of the whole context • This category, determines which word senses are used • For each subject code, count the number of words in the context that have the same subject code • Select the subject code that has the highest count • Accuracy ~50% (but with difficult and highly ambiguous words)
Some Results • Roget categories