270 likes | 293 Views
Knowledge-based Methods for Word Sense Disambiguation From a tutorial at AAAI by Ted Pedersen and Rada Mihalcea [edited by J. Wiebe]. Our last topic. NLP at a more fine-grained level; So far, we’ve only worked with document-level classification
E N D
Knowledge-based Methods for Word Sense DisambiguationFrom a tutorial at AAAI by Ted Pedersen and Rada Mihalcea[edited by J. Wiebe]
Our last topic • NLP at a more fine-grained level; So far, we’ve only worked with document-level classification • The question of Polysemy came up in the last topic (more than one meaning of a term); word-sense disambiguation addresses the problem • Includes various measures of semantic similarity, which can be used for clustering, search, paraphrase recognition, etc. • Introduce you to resources you can use if you ever work with text • Note: Ted Pedersen’s group created: • http://wn-similarity.sourceforge.net/ • Very useful!
Definitions • Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. • Sense Inventory usually comes from a dictionary or thesaurus. • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches • Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. • Unsupervised techniques
Computers versus Humans • Polysemy – most words have many possible meanings. • A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human… • Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…
Ambiguity for Humans - Newspaper Headlines! • DRUNK GETS NINE YEARS IN VIOLIN CASE • FARMER BILL DIES IN HOUSE • PROSTITUTES APPEAL TO POPE • STOLEN PAINTING FOUND BY TREE • RED TAPE HOLDS UP NEW BRIDGE • RESIDENTS CAN DROP OFF TREES • INCLUDE CHILDREN WHEN BAKING COOKIES • MINERS REFUSE TO WORK AFTER DEATH • [mixtures of part of speech, word sense, and syntactic ambiguities]
Ambiguity for a Computer • The fisherman jumped off the bank and into the water. • The bank down the street was robbed! • Back in the day, we had an entire bank of computers devoted to this problem. • The bank in that road is entirely too steep and is really dangerous. • The plane took a bank to the left, and then headed off towards the mountains.
Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods
Task Definition • Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text • Resources • Yes • Machine Readable Dictionaries • Raw corpora • No • Manually annotated corpora • Though combinations of these types of techniques and machine learning techniques are possible, of course
Machine Readable Dictionaries • In recent years, most dictionaries made available in Machine Readable format (MRD) • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Thesauruses – add synonymy information • Roget Thesaurus • Semantic networks – add more semantic relations • WordNet • EuroWordNet
WordNet definitions/examples for the noun plant • buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ • a living organism lacking the power of locomotion • something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" • an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience MRD – A Resource for Knowledge-based WSD • For each word in the language vocabulary, an MRD provides: • A list of meanings • Definitions (for all word meanings) • Typical usage examples (for most word meanings)
MRD – A Resource for Knowledge-based WSD • A thesaurus adds: • An explicit synonymy relation between word meanings • A semantic network adds: • Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}
Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods
Lesk Algorithm • (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Example: disambiguate PINE CONE • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness • CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1 Cone#1 = 0 Pine#2 Cone#1 = 0 Pine#1 Cone#2 = 1 Pine#2 Cone#2 = 0 Pine#1 Cone#3 = 2 Pine#2 Cone#3 = 0
Lesk Algorithm for More than Two Words? • I saw a man who is 98 years old and can still walk and tell jokes • nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3) • 43,929,600 sense combinations! How to find the optimal sense combination? • Simulated annealing (Cowie, Guthrie, Guthrie 1992) • Define a function E = combination of word senses in a given text. • Find the combination of senses that leads to highest definition overlap (redundancy) 1. Start with E = the most frequent sense for each word 2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E 3. Stop iterating when there is no change in the configuration of senses
Lesk Algorithm: A Simplified Version • Original Lesk definition: measure overlap between sense definitions for all words in context • Identify simultaneously the correct senses for all words in context • Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context • Identify the correct sense for one word at a time • Search space significantly reduced
Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness [Actually, would a WSD system be choosing between these?] Pine#1 Sentence = 1 Pine#2 Sentence = 0
Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness [Actually, would a WSD system be choosing between these?][Typically, no – they are different parts of speech. While POS taggers do make mistakes, they make fewer than WSD systems. Combined with a ML approach, one could assign the best overall interpretation, considering POS and sense.] Pine#1 Sentence = 1 Pine#2 Sentence = 0
Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Preferences • Measures of Semantic Similarity • Heuristic-based Methods
Selectional Preferences • A way to constrain the possible meanings of words in a given context • E.g. “Wash a dish” vs. “Cook a dish” • WASH-OBJECT vs. COOK-FOOD • Capture information about possible relations between semantic classes • Common sense knowledge • Alternative terminology • Selectional Restrictions • Selectional Preferences • Selectional Constraints
Acquiring Selectional Preferences • From annotated corpora • But sense annotated data are not plentiful • From raw corpora • Frequency counts • Information theory measures • Class-to-class relations
Preliminaries: Learning Word-to-Word Relations • An indication of the semantic fit between two words 1. Frequency counts • Pairs of words connected by a syntactic relations 2. Conditional probabilities • Condition on one of the words
From Resnik 1993 • The alternative view of selectional constraints I am proposing can be phrased as follows: rather than restrictions or hard constraints on applicability, a predicate preferentially associates with certain kinds of arguments, and these preferences constitute the effect that the predicate has on what appears in an argument position. For example, the predicate blue does not restrict itself to arguments having a tangible surface — the sky is blue, and so is ocean water even deep below any apparent surface — but its arguments are still far from arbitrary. The effect of the predicate is that its arguments tend to be physical entities and to have surfaces. Similarly, the verb admire, interpreted in the particular sense “to have a high opinion of,” has an effect on what appears as its subject; these tend to be physical, animate, human, capable of the higher psychological functions, and so forth. In some cases the effect a predicate has on its argument is quite strong: one is unlikely to find the (numerical) predicate even applied to anything but positive integers. In other cases — e.g. the predicate smooth — the effect is less dramatic.
Bringing in Information Theory • Entropy – how uncertain the outcome is (on ave) • “The cook basted the which noun?” Entropy(which noun?) is low, since the word is likely to be one of a small set of words, such as “turkey” or “roast”. • But the entropy is much higher in the following: • “The cook enjoyed the which noun?” since a much wider range of words is likely. (The opera, the company of the butler, a certain book, a particular food, …)
Learning Selectional Preferences • Word-to-class relations (Resnik 1993) • Quantify the contribution of a semantic class using all the concepts subsumed by that class • where
Learning Selectional Preferences • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses 50% occurrences are sense 1, 50% are sense 2 • That is, when you count co-occurrences in a corpus, count a word with 3 senses as 1/3, and a word with 5 senses as 1/5 • Example: learning restrictions for the verb “to drink” • Find high-scoring verb-object pairs • Find “prototypical” object classes (high association score) These are synsets in WN; i.e., lists of words but also a sense. They are hypernymes of the words above. Lookup in wordnet in class.
Learning Selectional Preferences (3) • Other algorithms • Learn class-to-class relations (Agirre and Martinez, 2002) • E.g.: “ingest food” is a class-to-class relation for “eat chicken” • Bayesian networks (Ciaramita and Johnson, 2000) • Tree cut model (Li and Abe, 1998)
Using Selectional Preferences for WSD Algorithm: 1. Learn a large set of selectional preferences for a given syntactic relation R 2. Given a pair of words W1– W2 connected by a relation R 3. Find all selectional preferences W1– C (word-to-class) or C1– C2 (class-to-class) that apply 4. Select the meanings of W1 and W2 based on the selected semantic class • Example: disambiguatecoffeein “drink coffee” 1. (beverage) a beverage consisting of an infusion of ground coffee beans 2. (tree) any of several small trees native to the tropical Old World 3. (color) a medium to dark brown color Given the selectional preference “DRINK BEVERAGE” : coffee#1