1 / 186

Natural Language Processing

Natural Language Processing. Chapter 19 Computational Lexical Semantics Part 2 [Includes slides from a AAAI-2005 tutorial by Rada Mihalcea and Ted Pedersen]. Word Senses. The meaning of a word in a given context Word sense representations With respect to a dictionary

sinjin
Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Chapter 19 Computational Lexical Semantics Part 2 [Includes slides from a AAAI-2005 tutorial by Rada Mihalcea and Ted Pedersen]

  2. Word Senses • The meaning of a word in a given context • Word sense representations • With respect to a dictionary chair= a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down" chair= the position of professor; "he was awarded an endowed chair in economics" • With respect to the translation in a second language chair = chaise chair = directeur • With respect to the context where it occurs (discrimination) “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

  3. Approaches to Word Sense Disambiguation • Knowledge-Based Disambiguation • use of external lexical resources such as dictionaries and thesauri • discourse properties • Supervised Disambiguation • based on a labeled training set • the learning system has: • a training set of feature-encoded inputs AND • their appropriate sense label (category) • Unsupervised Disambiguation • based on unlabeled corpora • The learning system has: • a training set of feature-encoded inputs BUT • NOT their appropriate sense label (category)

  4. All Words Word Sense Disambiguation • Minimally supervised approaches • Learn to disambiguate words using small annotated corpora • E.g. SemCor – corpus where all open class words are disambiguated • 200,000 running words • Most frequent sense

  5. Targeted Word Sense Disambiguation (we saw this earlier) • Disambiguate one target word “Take a seat on this chair” “The chair of the Math Department” • WSD is viewed as a typical classification problem • use machine learning techniques to train a system • Training: • Corpus of occurrences of the target word, each occurrence annotated with appropriate sense • Build feature vectors: • a vector of relevant linguistic features that represents the context (ex: a window of words around the target word) • Disambiguation: • Disambiguate the target word in new unseen text

  6. Knowledge-based Methods for Word Sense Disambiguation

  7. Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

  8. Task Definition • Knowledge-based WSD = class of WSD methods relying (mainly) on knowledge drawn from dictionaries and/or raw text • Resources • Yes • Machine Readable Dictionaries • Raw corpora • No • Manually annotated corpora • Scope • All open-class words

  9. Machine Readable Dictionaries • In recent years, most dictionaries made available in Machine Readable format (MRD) • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Thesauruses – add synonymy information • Roget Thesaurus • Semantic networks – add more semantic relations • WordNet • EuroWordNet

  10. WordNet definitions/examples for the noun plant • buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles“ • a living organism lacking the power of locomotion • something planted secretly for discovery by another; "the police used a plant to trick the thieves"; "he claimed that the evidence against him was a plant" • an actor situated in the audience whose acting is rehearsed but seems spontaneous to the audience MRD – A Resource for Knowledge-based WSD • For each word in the language vocabulary, an MRD provides: • A list of meanings • Definitions (for all word meanings) • Typical usage examples (for most word meanings)

  11. MRD – A Resource for Knowledge-based WSD • A thesaurus adds: • An explicit synonymy relation between word meanings • A semantic network adds: • Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF), antonymy, entailnment, etc. WordNet synsets for the noun “plant” 1. plant, works, industrial plant 2. plant, flora, plant life WordNet related concepts for the meaning “plant life” {plant, flora, plant life} hypernym: {organism, being} hypomym: {house plant}, {fungus}, … meronym: {plant tissue}, {plant part} holonym: {Plantae, kingdom Plantae, plant kingdom}

  12. Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

  13. Lesk Algorithm • (Michael Lesk 1986): Identify senses of words in context using definition overlap Algorithm: • Retrieve from MRD all sense definitions of the words to be disambiguated • Determine the definition overlap for all possible sense combinations • Choose senses that lead to highest overlap Example: disambiguate PINE CONE • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness • CONE 1. solid body which narrows to a point 2. something of this shape whether solid or hollow 3. fruit of certain evergreen trees Pine#1  Cone#1 = 0 Pine#2  Cone#1 = 0 Pine#1  Cone#2 = 1 Pine#2  Cone#2 = 0 Pine#1  Cone#3 = 2 Pine#2  Cone#3 = 0

  14. Lesk Algorithm for More than Two Words? • I saw a man who is 98 years old and can still walk and tell jokes • nine open class words: see(26), man(11), year(4), old(8), can(5), still(4), walk(10), tell(8), joke(3) • 43,929,600 sense combinations! How to find the optimal sense combination? • Simulated annealing (Cowie, Guthrie, Guthrie 1992) • Define a function E = combination of word senses in a given text. • Find the combination of senses that leads to highest definition overlap (redundancy) 1. Start with E = the most frequent sense for each word 2. At each iteration, replace the sense of a random word in the set with a different sense, and measure E 3. Stop iterating when there is no change in the configuration of senses

  15. Lesk Algorithm: A Simplified Version • Original Lesk definition: measure overlap between sense definitions for all words in context • Identify simultaneously the correct senses for all words in context • Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap between sense definitions of a word and current context • Identify the correct sense for one word at a time • Search space significantly reduced

  16. Lesk Algorithm: A Simplified Version • Algorithm for simplified Lesk: • Retrieve from MRD all sense definitions of the word to be disambiguated • Determine the overlap between each sense definition and the current context • Choose the sense that leads to highest overlap Example: disambiguate PINE in “Pine cones hanging in a tree” • PINE 1. kinds of evergreen tree with needle-shaped leaves 2. waste away through sorrow or illness Pine#1  Sentence = 1 Pine#2  Sentence = 0

  17. Evaluations of Lesk Algorithm • Initial evaluation by M. Lesk • 50-70% on short samples of text manually annotated set, with respect to Oxford Advanced Learner’s Dictionary • Simulated annealing • 47% on 50 manually annotated sentences • Evaluation on Senseval-2 all-words data, with back-off to random sense (Mihalcea & Tarau 2004) • Original Lesk: 35% • Simplified Lesk: 47% • Evaluation on Senseval-2 all-words data, with back-off to most frequent sense (Vasilescu, Langlais, Lapalme 2004) • Original Lesk: 42% • Simplified Lesk: 58%

  18. Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Preferences • Measures of Semantic Similarity • Heuristic-based Methods

  19. Unsupervised Disambiguation • Disambiguate word senses: • without supporting tools such as dictionaries and thesauri • without a labeled training text • Without such resources, word senses are not labeled • We cannot say “chair/furniture” or “chair/person” • We can: • Cluster/group the contexts of an ambiguous word into a number of groups • Discriminate between these groups without actually labeling them

  20. Unsupervised Disambiguation • Hypothesis: same senses of words will have similar neighboring words • Disambiguation algorithm • Identify context vectors corresponding to all occurrences of a particular word • Partition them into regions of high density • Assign a sense to each such region “Sit on a chair” “Take a seat on this chair” “The chair of the Math Department” “The chair of the meeting”

  21. Evaluating Word Sense Disambiguation • Metrics: • Precision = percentage of words that are tagged correctly, out of the words addressed by the system • Recall = percentage of words that are tagged correctly, out of all words in the test set • Example • Test set of 100 words Precision = 50 / 75 = 0.66 • System attempts 75 words Recall = 50 / 100 = 0.50 • Words correctly disambiguated 50 • Special tags are possible: • Unknown • Proper noun • Multiple senses • Compare to a gold standard • SEMCOR corpus, SENSEVAL corpus, …

  22. Evaluating Word Sense Disambiguation • Difficulty in evaluation: • Nature of the senses to distinguish has a huge impact on results • Coarse versus fine-grained sense distinction chair= a seat for one person, with a support for the back; "he put his coat over the back of the chair and sat down“ chair= the position of professor; "he was awarded an endowed chair in economics“ bank = a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home" bank = a building in which commercial banking is transacted; "the bank is on the corner of Nassau and Witherspoon“ • Sense maps • Cluster similar senses • Allow for both fine-grained and coarse-grained evaluation

  23. Bounds on Performance • Upper and Lower Bounds on Performance: • Measure of how well an algorithm performs relative to the difficulty of the task. • Upper Bound: • Human performance • Around 97%-99% with few and clearly distinct senses • Inter-judge agreement: • With words with clear & distinct senses – 95% and up • With polysemous words with related senses – 65% – 70% • Lower Bound (or baseline): • The assignment of a random sense / the most frequent sense • 90% is excellent for a word with 2 equiprobable senses • 90% is trivial for a word with 2 senses with probability ratios of 9 to 1

  24. References • (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating upper and lower bounds on the performance of word-sense disambiguation programs ACL 1992. • (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. • (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. • (Senseval) Senseval evaluation exercises http://www.senseval.org

  25. Selectional Preferences • A way to constrain the possible meanings of words in a given context • E.g. “Wash a dish” vs. “Cook a dish” • WASH-OBJECT vs. COOK-FOOD • Capture information about possible relations between semantic classes • Common sense knowledge • Alternative terminology • Selectional Restrictions • Selectional Preferences • Selectional Constraints

  26. Acquiring Selectional Preferences • From annotated corpora • Circular relationship with the WSD problem • Need WSD to build the annotated corpus • Need selectional preferences to derive WSD • From raw corpora • Frequency counts • Information theory measures • Class-to-class relations

  27. Preliminaries: Learning Word-to-Word Relations • An indication of the semantic fit between two words 1. Frequency counts • Pairs of words connected by a syntactic relations 2. Conditional probabilities • Condition on one of the words

  28. Learning Selectional Preferences (1) • Word-to-class relations (Resnik 1993) • Quantify the contribution of a semantic class using all the concepts subsumed by that class • where

  29. Learning Selectional Preferences (2) • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 • Example: learning restrictions for the verb “to drink” • Find high-scoring verb-object pairs • Find “prototypical” object classes (high association score)

  30. Learning Selectional Preferences (3) • Other algorithms • Learn class-to-class relations (Agirre and Martinez, 2002) • E.g.: “ingest food” is a class-to-class relation for “eat chicken” • Bayesian networks (Ciaramita and Johnson, 2000) • Tree cut model (Li and Abe, 1998)

  31. Using Selectional Preferences for WSD Algorithm: 1. Learn a large set of selectional preferences for a given syntactic relation R 2. Given a pair of words W1– W2 connected by a relation R 3. Find all selectional preferences W1– C (word-to-class) or C1– C2 (class-to-class) that apply 4. Select the meanings of W1 and W2 based on the selected semantic class • Example: disambiguatecoffeein “drink coffee” 1. (beverage) a beverage consisting of an infusion of ground coffee beans 2. (tree) any of several small trees native to the tropical Old World 3. (color) a medium to dark brown color Given the selectional preference “DRINK BEVERAGE” : coffee#1

  32. Evaluation of Selectional Preferences for WSD • Data set • mainly on verb-object, subject-verb relations extracted from SemCor • Compare against random baseline • Results (Agirre and Martinez, 2000) • Average results on 8 nouns • Similar figures reported in (Resnik 1997)

  33. Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

  34. Semantic Similarity • Words in a discourse must be related in meaning, for the discourse to be coherent (Haliday and Hassan, 1976) • Use this property for WSD – Identify related meanings for words that share a common context • Context span: 1. Local context: semantic similarity between pairs of words 2. Global context: lexical chains

  35. Semantic Similarity in a Local Context • Similarity determined between pairs of concepts, or between a word and its surrounding context • Relies on similarity metrics on semantic networks • (Rada et al. 1989) carnivore fissiped mamal, fissiped canine, canid feline, felid bear wolf wild dog dog hyena dingo hyena dog hunting dog dachshund terrier

  36. Semantic Similarity Metrics (1) • Input: two concepts (same part of speech) • Output: similarity measure • (Leacock and Chodorow 1998) • E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42 • (Resnik 1995) • Define information content, P(C) = probability of seeing a concept of type C in a large corpus • Probability of seeing a concept = probability of seeing instances of that concept • Determine the contribution of a word sense based on the assumption of equal sense distributions: • e.g. “plant” has two senses  50% occurrences are sense 1, 50% are sense 2 , D is the taxonomy depth

  37. Semantic Similarity Metrics (2) • Similarity using information content • (Resnik 1995) Define similarity between two concepts (LCS = Least Common Subsumer) • Alternatives (Jiang and Conrath 1997) • Other metrics: • Similarity using information content (Lin 1998) • Similarity using gloss-based paths across different hierarchies (Mihalcea and Moldovan 1999) • Conceptual density measure between noun semantic hierarchies and current context (Agirre and Rigau 1995) • Adapted Lesk algorithm (Banerjee and Pedersen 2002)

  38. Semantic Similarity Metrics for WSD • Disambiguate target words based on similarity with one word to the left and one word to the right • (Patwardhan, Banerjee, Pedersen 2002) • Evaluation: • 1,723 ambiguous nouns from Senseval-2 • Among 5 similarity metrics, (Jiang and Conrath 1997) provide the best precision (39%) Example: disambiguate PLANT in “plant with flowers” PLANT plant, works, industrial plant plant, flora, plant life Similarity (plant#1, flower) = 0.2 Similarity (plant#2, flower) = 1.5 : plant#2

  39. Semantic Similarity in a Global Context • Lexical chains (Hirst and St-Onge 1988), (Haliday and Hassan 1976) • “A lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse” Algorithmfor finding lexical chains: • Select the candidate words from the text. These are words for which we can compute similarity measures, and therefore most of the time they have the same part of speech. • For each such candidate word, and for each meaning for this word, find a chain to receive the candidate word sense, based on a semantic relatedness measure between the concepts that are already in the chain, and the candidate word meaning. • If such a chain is found, insert the word in this chain; otherwise, create a new chain.

  40. Semantic Similarity of a Global Context A very long traintraveling along the railswith a constant velocityv in a certain direction… train #1: public transport #1 change location # 2: a bar of steel for trains #2: order set of things #3: piece of cloth travel #2: undergo transportation rail #1: a barrier #3: a small bird

  41. Lexical Chains for WSD • Identify lexical chains in a text • Usually target one part of speech at a time • Identify the meaning of words based on their membership to a lexical chain • Evaluation: • (Galley and McKeown 2003) lexical chains on 74 SemCor texts give 62.09% • (Mihalcea and Moldovan 2000) on five SemCor texts give 90% with 60% recall • lexical chains “anchored” on monosemous words • (Okumura and Honda 1994) lexical chains on five Japanese texts give 63.4%

  42. Outline • Task definition • Machine Readable Dictionaries • Algorithms based on Machine Readable Dictionaries • Selectional Restrictions • Measures of Semantic Similarity • Heuristic-based Methods

  43. Most Frequent Sense (1) • Identify the most often used meaning and use this meaning by default • Word meanings exhibit a Zipfian distribution • E.g. distribution of word senses in SemCor • Example: “plant/flora” is used more often than “plant/factory” • - annotate any instance of PLANT as “plant/flora”

  44. Most Frequent Sense (2) • Method 1: Find the most frequent sense in an annotated corpus • Method 2: Find the most frequent sense using a method based on distributional similarity (McCarthy et al. 2004) 1. Given a word w, find the top k distributionally similar words Nw = {n1, n2, …, nk}, with associated similarity scores {dss(w,n1), dss(w,n2), … dss(w,nk)} 2. For each sense wsi of w, identify the similarity with the words nj, using the sense of nj that maximizes this score 3. Rank senses wsi of w based on the total similarity score

  45. Most Frequent Sense(3) • Word senses • pipe #1 = tobacco pipe • pipe #2 = tube of metal or plastic • Distributional similar words • N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …} • For each word in N, find similarity with pipe#i (using the sense that maximizes the similarity) • pipe#1 – tube (#3) = 0.3 • pipe#2 – tube (#1) = 0.6 • Compute score for each sense pipe#i • score (pipe#1) = 0.25 • score (pipe#2) = 0.73 Note: results depend on the corpus used to find distributionally similar words => can find domain specific predominant senses

  46. One Sense Per Discourse • A word tends to preserve its meaning across all its occurrences in a given discourse (Gale, Church, Yarowksy 1992) • What does this mean? • Evaluation: • 8 words with two-way ambiguity, e.g. plant, crane, etc. • 98% of the two-word occurrences in the same discourse carry the same meaning • The grain of salt: Performance depends on granularity • (Krovetz 1998) experiments with words with more than two senses • Performance of “one sense per discourse” measured on SemCor is approx. 70% E.g. The ambiguous word PLANT occurs 10 times in a discourse all instances of “plant” carry the same meaning

  47. One Sense per Collocation • A word tends to preserver its meaning when used in the same collocation (Yarowsky 1993) • Strong for adjacent collocations • Weaker as the distance between words increases • An example • Evaluation: • 97% precision on words with two-way ambiguity • Finer granularity: • (Martinez and Agirre 2000) tested the “one sense per collocation” hypothesis on text annotated with WordNet senses • 70% precision on SemCor words The ambiguous word PLANT preserves its meaning in all its occurrences within the collocation “industrial plant”, regardless of the context where this collocation occurs

  48. References • (Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation using conceptual distance. RANLP 1995. • (Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional preferences. CONLL 2001. •  (Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for word sense disambiguation using WordNet. CICLING 2002. • (Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical disambiguation using simulated annealing. COLING 2002. • (Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per discourse. DARPA workshop 1992. • (Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English. Longman. • (Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense disambiguation in lexical chaining. IJCAI 2003 • (Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of context in the detection and correction of malaproprisms. WordNet: An electronic lexical database, MIT Press. • (Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. COLING 1997. • (Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX 1998. • (Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. SIGDOC 1986. • (Lin 1998) Lin, D An information theoretic definition of similarity. ICML 1998.

  49. References • (Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and genre/topic variations. EMNLP 2000. • (Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R. Using a semantic concordance for sense identification. ARPA Workshop 1994. • (Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995. • (Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word sense disambiguation of unrestricted text. ACL 1999. • (Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach to word sense disambiguation. FLAIRS 2000. • (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic Networks with Application to Word Sense Disambiguation, COLING 2004. • (Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation. CICLING 2003. • (Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1) 1989. • (Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical Relationships. University of Pennsylvania 1993.   • (Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity. IJCAI 1995. • (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating variants of the Lesk approach for disambiguating words”, LREC 2004. • (Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.

  50. Part 4: Supervised Methods of Word Sense Disambiguation

More Related