320 likes | 623 Views
Definitions. Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. Sense Inventory usually comes from a dictionary or thesaurus.Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches Word sense discrim
E N D
1. Word sense disambiguation (1)
Instructor: Rada Mihalcea
Note: Some of the material in this slide set was adapted from a tutorial given by Rada Mihalcea & Ted Pedersen at ACL 2005
2. Definitions Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.
Sense Inventory usually comes from a dictionary or thesaurus.
Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches
Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.
Unsupervised techniques
3. Computers versus Humans Polysemy most words have many possible meanings.
A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human
Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases
4. Ambiguity for Humans - Newspaper Headlines! DRUNK GETS NINE YEARS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE
RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH
5. Ambiguity for a Computer The fisherman jumped off the bank and into the water.
The bank down the street was robbed!
Back in the day, we had an entire bank of computers devoted to this problem.
The bank in that road is entirely too steep and is really dangerous.
The plane took a bank to the left, and then headed off towards the mountains.
6. Early Days of WSD Noted as problem for Machine Translation (Weaver, 1949)
A word can often only be translated if you know the specific sense intended (A bill in English could be a pico or a cuenta in Spanish)
Bar-Hillel (1960) posed the following:
Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.
Is pen a writing instrument or an enclosure where children play?
declared it unsolvable, left the field of MT!
7. Since then
1970s - 1980s
Rule based systems
Rely on hand crafted knowledge sources
1990s
Corpus based approaches
Dependence on sense tagged text
(Ide and Veronis, 1998) overview history from early days to 1998.
2000s
Hybrid Systems
Minimizing or eliminating use of sense tagged text
Taking advantage of the Web
8. Practical Applications Machine Translation
Translate bill from English to Spanish
Is it a pico or a cuenta?
Is it a bird jaw or an invoice?
Information Retrieval
Find all Web Pages about cricket
The sport or the insect?
Question Answering
What is George Millers position on gun control?
The psychologist or US congressman?
Knowledge Acquisition
Add to KB: Herb Bergson is the mayor of Duluth.
Minnesota or Georgia?
9. Knowledge-based WSD Task definition
Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text
Resources
Yes
Machine Readable Dictionaries
Raw corpora
No
Manually annotated corpora
Scope
All open-class words
10. Machine Readable Dictionaries In recent years, most dictionaries made available in Machine Readable format (MRD)
Oxford English Dictionary
Collins
Longman Dictionary of Ordinary Contemporary English (LDOCE)
Thesauruses add synonymy information
Roget Thesaurus
Semantic networks add more semantic relations
WordNet
EuroWordNet
11. MRD A Resource for Knowledge-based WSD For each word in the language vocabulary, an MRD provides:
A list of meanings
Definitions (for all word meanings)
Typical usage examples (for most word meanings)
12. MRD A Resource for Knowledge-based WSD A thesaurus adds:
An explicit synonymy relation between word meanings
A semantic network adds:
Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc.
13. Lesk Algorithm (Michael Lesk 1986): Identify senses of words in context using definition overlap
Algorithm:
Retrieve from MRD all sense definitions of the words to be disambiguated
Determine the definition overlap for all possible sense combinations
Choose senses that lead to highest overlap
14. Lesk Algorithm for More than Two Words? I saw a man who is 98 years old and can still walk and tell jokes
nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3)
43,929,600 sense combinations! How to find the optimal sense combination?
Simulated annealing (Cowie, Guthrie, Guthrie 1992)
Define a function E = combination of word senses in a given text.
Find the combination of senses that leads to highest definition overlap (redundancy)
1. Start with E = the most frequent sense for each word
2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E
3. Stop iterating when there is no change in the configuration of senses
15. Lesk Algorithm: A Simplified Version Original Lesk definition: measure overlap between sense definitions for all words in context
Identify simultaneously the correct senses for all words in context
Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context
Identify the correct sense for one word at a time
Search space significantly reduced
16. Lesk Algorithm: A Simplified Version
17. Evaluations of Lesk Algorithm Initial evaluation by M. Lesk
50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learners Dictionary
Simulated annealing
47% on 50 manually annotated sentences
Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004)
Original Lesk: 35%
Simplified Lesk: 47%
Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004)
Original Lesk: 42%
Simplified Lesk: 58%
18. Selectional Preferences A way to constrain the possible meanings of words in a given context
E.g. Wash a dish vs. Cook a dish
WASH-OBJECT vs. COOK-FOOD
Capture information about possible relations between semantic classes
Common sense knowledge
Alternative terminology
Selectional Restrictions
Selectional Preferences
Selectional Constraints
19. Acquiring Selectional Preferences From annotated corpora
Circular relationship with the WSD problem
Need WSD to build the annotated corpus
Need selectional preferences to derive WSD
From raw corpora
Frequency counts
Information theory measures
Class-to-class relations
20. Preliminaries: Learning Word-to-Word Relations An indication of the semantic fit between two words
1. Frequency counts
Pairs of words connected by a syntactic relations
2. Conditional probabilities
Condition on one of the words
21. Learning Selectional Preferences (1) Word-to-class relations (Resnik 1993)
Quantify the contribution of a semantic class using all the concepts subsumed by that class
where
22. Learning Selectional Preferences (2) Determine the contribution of a word sense based on the assumption of equal sense distributions:
e.g. plant has two senses ? 50% occurrences are sense 1, 50% are sense 2
Example: learning restrictions for the verb to drink
Find high-scoring verb-object pairs
Find prototypical object classes (high association score)
23. Using Selectional Preferences for WSD Algorithm:
1. Learn a large set of selectional preferences for a given syntactic relation R
2. Given a pair of words W1 W2 connected by a relation R
3. Find all selectional preferences W1 C (word-to-class) or C1 C2 (class-to-class) that apply
4. Select the meanings of W1 and W2 based on the selected semantic class
24. Evaluation of Selectional Preferences for WSD Data set
mainly on verb-object, subject-verb relations extracted from SemCor
Compare against random baseline
Results (Agirre and Martinez, 2000)
Average results on 8 nouns
Similar figures reported in (Resnik 1997)
25. Semantic Similarity Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976)
Use this property for WSD Identify related meanings for words that share a common context
Context span:
1. Local context: semantic similarity between pairs of words
2. Global context: lexical chains
26. Semantic Similarity in a Local Context Similarity determined between pairs of concepts, or between a word and its surrounding context
Relies on similarity metrics on semantic networks
(Rada et al. 1989)
27. Semantic Similarity Metrics for WSD Disambiguate target words based on similarity with one word to the left and one word to the right
(Patwardhan, Banerjee, Pedersen 2002)
Evaluation:
1,723 ambiguous nouns from Senseval-2
Among 5 similarity metrics, (Jiang and Conrath 1997) provide the best precision (39%)
28. Semantic Similarity in a Global Context Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976)
A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse
Algorithm for finding lexical chains:
Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech.
For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning.
If such a chain is found, insert the word in this chain; otherwise, create a new chain.
29. Semantic Similarity of a Global Context
30. Lexical Chains for WSD Identify lexical chains in a text
Usually target one part of speech at a time
Identify the meaning of words based on their membership to a lexical chain
Evaluation:
(Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09%
(Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall
lexical chains anchored on monosemous words
(Okumura and Honda 1994) lexical chains on five Japanese texts give 63.4%
31. Heuristics: Most Frequent Sense Identify the most often used meaning and use this meaning by default
Word meanings exhibit a Zipfian distribution
E.g. distribution of word senses in SemCor
32. Heuristics: One Sense Per Discourse A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992)
What does this mean?
Evaluation:
8 words with two-way ambiguity, e.g. plant, crane, etc.
98% of the two-word occurrences in the same discourse carry the same meaning
The grain of salt: Performance depends on granularity
(Krovetz 1998) experiments with words with more than two senses
Performance of one sense per discourse measured on SemCor is approx. 70%
33. Heuristics: One Sense per Collocation A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993)
Strong for adjacent collocations
Weaker as the distance between words increases
An example
Evaluation:
97% precision on words with two-way ambiguity
Finer granularity:
(Martinez and Agirre 2000) tested the one sense per collocation hypothesis on text annotated with WordNet senses
70% precision on SemCor words