980 likes | 1.16k Views
LING / C SC 439/539 Statistical Natural Language Processing. Lecture 20 4 /1/2013. Recommended reading. Word Sense Disambiguation Jurafsky & Martin 20.0-20.4
E N D
LING / C SC 439/539Statistical Natural Language Processing • Lecture 20 • 4/1/2013
Recommended reading • Word Sense Disambiguation • Jurafsky & Martin 20.0-20.4 • David Yarowsky. 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French. Proc. of ACL. • David Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Proc. of ACL. • Discuss again next week • Coreference resolution • Jurafsky & Martin 21.3 • Aria Haghighi & Dan Klein. 2009. Simple Coreference Resolution with Rich Syntactic and Semantic Features. Proc. of EMNLP. • Vincent Ng. 2010. Supervised Noun Phrase Coreference Research: The First Fifteen Years. Proc. of ACL. • Information extraction • Jurafsky & Martin Chapter 22
Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction
Generative probabilistic models • Assigns a probability distribution over all possible outcomes of all variables • Make independence and conditional independence assumptions about the data • Otherwise leads to sparse data problem in parameter estimation • Such assumptions are made according to one’s theory about the structure of data
Generative probabilistic models • Example: language model • Observed variables only • Sequence of words W • Generative model: Nth-order Markov model p(W)
Generative models with hidden variables • Example: Naïve Bayes • Observed: vector of features X • Hidden: class variable C • Generative model: p(C, X)
Generative models with hidden variables • Example: HMM (for POS tagging) • Observed: sequence of words W • Hidden: sequence of POS tags T • Generative model: p(W, T)
Generative models with hidden variables • Example: PCFG • Observed: sentence S • Hidden: parse tree T • Generative model: p(S, T) • p(S, T) = product of rules to derive sentence S with phrase structure tree T from the start symbol
Common problems for generative models • Parameter estimation • Estimate probabilities from a corpus • Calculate probability of an observation • E.g., probability of a sentence • Marginalize over hidden variables (an ambiguous observation may have multiple hidden structures) • Classification (also called decoding) • Find most likely hidden structure for observations
Classification in generative models • Classification • Want to find most likely values for hidden variables given observations • Compute argmaxH p(H|O) • Use Bayes rule • Generative model defines p(O, H) • Use Bayes to obtain p(H|O) • argmaxH p(H|O) = argmaxH p(O|H) * p(H)
Bayes rule and classification • Bayes rule: p(B | A) = p(A, B) p(A) • Product rule: p(A, B) = p(A) * p(B | A) = p(B) * p(A | B) • argmaxB p(B | A) = argmaxB p(A|B) * p(B) • Compute posterior p(B|A) given the likelihood p(A|B) and the prior p(B) • Ignore prior p(A), since it is a constant
Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction
Classification in generative models • Data described by a set of random variables • C ∈ { c1, …, cn }: the class to be predicted • F = f1, f2, … fn: feature variables • Classification: choose the most likely class given features • Generative model defines a joint distribution: p( C, F ) • Use Bayes rule to recover conditional prob.: p( C | F )
Discriminative models of classification • Find a class C that maximizes p( C | F ) • C ∈ { c1, …, cn }: the class to be predicted • F = f1, f2, … fn: feature variables • Discriminative: directly model p( C | F ) • “Discriminative”: find out what distinguishes the classes • Compare to generative • First model p(C, F), then use Bayes rule to calculate p(F|C)*p(C), which is equivalent to p(C | F ) • “Generative”: probability distribution of all of the data
Popular discriminative classifiers • Decision List • Binary classification, single feature • Logistic Regression • Binary classification, vector of features • Maximum Entropy • Multiclass classification, vector of features • Conditional Random Field • Multiclass classification, sequential classifier, vectors of features • (SVM, Perceptron) • Discriminative, though not probabilistic
Generative vs. discriminative classifiers • According to Bayes Rule, generative and discriminative classifiers should be equivalent: argmaxC p( C | F ) = argmaxC p( F | C ) * p( C ) = argmaxC p( F, C ) • Discriminative: argmaxC p( C | F ) • Generative: argmaxC p( F | C ) * p( C )
Generative vs. discriminative: independence assumptions • Generative: model joint prob of classes and features: p(C, F) • Often have to make severe independence assumptions • e.g. Naïve Bayes: • In discriminative classifiers where we model p(C|F), we don’t need to make such independence assumptions • We can add use non-independent features without specifying their probabilistic dependencies • Example of dependent features: • “Current word begins with capital letter” • “Current word is all-caps”
Generative vs. discriminative classifiers • Vapnik 1998: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step” • Model p( C | F ) directly • Don’t first define p(C, F), then use it to obtain p( C | F ) • Problem: additional difficulties with parameter estimation in the joint model
Summary of probabilistic classifiers(joint = generative, conditional = discriminative)
Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction
Disambiguation problems • For these problems, the instance to be classified is ambiguous: • Accent restoration • Word sense disambiguation • Capitalization restoration • Binary classification problems
Problem 1: accent restoration • Some languages, such as French and Spanish, are written with accents on characters, where these accents determine word identity • Text is sometimes typed written without accents • Need to perform accent restoration to recover intended words • Example: … unefamille des pecheurs • pêcheurs (fishermen) • pécheurs (sinners)
Problem 2: capitalization restoration • Text is sometimes written in all capitals or all lower case, and needs disambiguation • “AIDS …” • disease or helpful tools? • Words at the beginning of a sentence are capitalized • “Bush …” • president or shrub?
Problem 3: word sense disambiguation (WSD) • The bank on State Street • Possible meanings of “bank” • Sense 1: river bank • Sense 2: place for $$$ • Need word sense disambiguation • Given an ambiguous word, decide on its sense
WSD is important in translation • Translation into Korean: • Iraq lost the battle. Ilakukacentweyciessta. [Iraq ] [battle] [lost] • John lost his computer. John-i computer-lulilepelyessta. [John] [computer] [misplaced] • Semantic Constraints: lose1(Agent, Patient: competition) <=> ciessta lose2 (Agent, Patient: physobj) <=> ilepelyessta
WSD is needed in speech synthesis (convert text to sound) • … slightly elevated lead levels • Sense 1: lead role (rhymes with seed) • Sense 2: lead mines (rhymes with bed) • The speaker produces too little bass • Sense 1: string bass (rhymes with vase) • Sense 2: sea bass (rhymes with pass)
Word sense disambiguation • For a particular word, can its senses be distinguished? • First, need a set of senses to be predicted • WordNet: • Hierarchically organized database of senses for open-class words in English • http://www.cogsci.princeton.edu/~wn/
Word Sense Word Sense aim register • Point or direct object, weapon, at something ... • Wish, purpose or intend to achieve something • Enter into an official record • Be aware of, enter into someone’s consciousness • Indicate a measurement • Show in one’s face Word senses in WordNet • Meaning of nouns, verbs, and adjectives are specified using a catalog of possible senses Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon . • Enter into an official record • Wish, purpose or intend to achieve something
Words can have many senses in WordNet;for WSD, let’s assume each word has 2 senses The noun bass has 8 senses. 1. bass -- (the lowest part of the musical range) 2. bass, bass part -- (the lowest part in polyphonic music) 3. bass, basso -- (an adult male singer with the lowest voice) 4. sea bass, bass -- (the lean flesh of a saltwater fish of the family Serranidae) 5. freshwater bass, bass -- (any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)) 6. bass, bass voice, basso -- (the lowest adult male singing voice) 7. bass -- (the member with the lowest range of a family of musical instruments) 8. bass -- (nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes) The adj bass has 1 sense. 1. bass, deep -- (having or denoting a low vocal or instrumental range; "a deep voice"; "a bass voice is lower than a baritone voice"; "a bass clarinet")
Senseval-1 (1998): English, French, Italian WSDhttp://www.senseval.org/ • 35 different words • were tested • Table shows # of test instances per word
How can we do WSD? • Disambiguate a word by looking at its context • Warren Weaver, 1949: “If one examines the words in a book, one at a time as through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of the words […] But if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then if N is large enough one can unambiguously decide the meaning of the central word.”
Example: WSD through context • What does this word mean in each case? • The human hand consists of a broad palm with 5 digits, attached to the forearm by a joint called the wrist (carpus). • Neither the anatomy of the palm tree stems nor the conformation of their flowers, however, entitles them to any such high position in the vegetable hierarchy
Can’t build a system by hand • Fernand Marty (1986, 1992) • French text-to-speech synthesis • Hand-formulated rules and heuristics • Rule: presence of deposit in the vicinity of bank indicates $$$ • Problem: lots and lots and lots and lots of rules would be needed
Outline • Generative models of classification • Generative vs. discriminative classifiers • Disambiguation problems • Decision List • Coreference resolution • Information extraction
Decision List • David Yarowsky (1994, 1995) • A simple discriminative classifier • Compute argmaxC p(C|F) • Compare: p(C1|f1), p(C2|f1), … p(C1|fn), p(C2|fn) • Choose class based on largest difference in p( Ci | fj ) for a feature fj in the data to be classified
Decision List for WSD: p(sense|feature) • The decision list compares the conditional probabilities of senses given various features, to determine the probabilistically most likely sense for a word. • Example: disambiguate ‘bank’ in this sentence: • I checked my boat at the marina next to the bank of the river near the shore. • p( money-sense | ‘check’ ) • p( river-sense| ‘check’ ) • p( money-sense | ‘shore’ ) • p( river-sense | ‘shore’ ) let’s say this has highest prob
Automatically build disambiguation system • Yarowsky’s method: • Get corpus with words annotated for different categories • Formulate templates for generation of disambiguating rules • Algorithm constructs all such rules from a corpus • Algorithm selects relevant rules through statistics of usage for each category • Methodology can be applied to any binary disambiguation problem
Rule templates Possible rules + Ranked rules annotated corpus Statistics of usage
Decision list algorithm: step 1, identify ambiguities • Example problem: accent restoration
Step 2: Collect training contexts • Begin with an annotated corpus • (In this context, a corpus with accents indicated)
Step 3: Specify rule templates • Given a particular training context, collect: • Word immediately to the right (+1 W) or left (-1 W) • Word found in ±k word window • Pair of words at fixed offsets • Other evidence can be used: • Lemma (morphological root) • Part of speech category • Other types of word classes (e.g. set of days of week)
Which rules are indicative of a category? • Two categories c1 and c2; p(c1|rule) + p(c2|rule) = 1 • Log-likelihood ratio: log( p(c1|rule) / p(c2|rule) ) • If p(c1|rule) = 0.5 and p(c2|rule) = 0.5, doesn’t distinguish log( p(c1 | rule) / p(c2 | rule) ) = 0 • If p(c1|rule) > 0.5 and p(c2|rule) < 0.5, c1 is more likely log( p(c1 | rule) / p(c2 | rule) ) > 0 • If p(c1|rule) < 0.5 and p(c2|rule) > 0.5, c2 is more likely log( p(c1 | rule) / p(c2 | rule) ) < 0
Which rules are best for disambiguating between categories? • Use absolute value of log-likelihood ratio: abs(log( p(sense1 | rule) / p(sense2 | rule) )) • Rank rules by abs. value of log-likelihood ratio • Rules that best distinguish between the two categories are ranked highest
Step 5: Choose rules that are indicative of categories: sort by abs(LogL) • This is the final decision list
Step 6: classify new data with decision list • For a sentence with a word to be disambiguated: • Go down the ranked list of rules in the decision list • Find the first rule with a matching context • Assign a sense according to that rule • Finished. • Ignore other lower-ranked rules, even if they have matching contexts as well
Example: disambiguate “plant” • Radiation from the crippled nuclear plant in Japan is showing up in rain in the United States.
How well does it work? • Simple statistical model • Easy to implement • Test performance on several different disambiguation problems
Performance: accent restoration • On ambiguous cases, 98% correct • Examples: cóte / cóté: 98% décidé / décide: 97% hacia / hacía: 97%