1 / 62

Statistical NLP Spring 2011

Statistical NLP Spring 2011. Lecture 6: POS / Phrase MT. Dan Klein – UC Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A. Parts-of-Speech (English). One basic kind of linguistic structure: syntactic word classes.

saad
Download Presentation

Statistical NLP Spring 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical NLPSpring 2011 Lecture 6: POS / Phrase MT Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

  2. Parts-of-Speech (English) • One basic kind of linguistic structure: syntactic word classes Open class (lexical) words Nouns Verbs Adjectives yellow Proper Common Main Adverbs slowly IBM Italy cat / cats snow see registered Numbers … more 122,312 one Closed class (functional) Modals Determiners Prepositions the some to with can had Conjunctions Particles and or off up Pronouns he its … more

  3. Part-of-Speech Ambiguity • Words can have multiple parts of speech • Two basic sources of constraint: • Grammatical environment • Identity of the current word • Many more possible features: • Suffixes, capitalization, name databases (gazetteers), etc… VBD VB VBN VBZ VBP VBZ NNP NNS NN NNS CD NN Fed raises interest rates 0.5 percent

  4. Why POS Tagging? • Useful in and of itself (more than you’d think) • Text-to-speech: record, lead • Lemmatization: saw[v]  see, saw[n]  saw • Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS} • Useful as a pre-processing step for parsing • Less tag ambiguity means fewer parses • However, some tag choices are better decided by parsers IN DT NNP NN VBD VBN RP NN NNS The Georgia branch had taken on loan commitments … VDN DT NN IN NN VBD NNS VBD The average of interbank offered rates plummeted …

  5. Classic Solution: HMMs s0 s1 s2 sn w1 w2 wn • We want a model of sequences s and observations w • Assumptions: • States are tag n-grams • Usually a dedicated start and end state / word • Tag/state sequence is generated by a markov model • Words are chosen independently, conditioned only on the tag/state • These are totally broken assumptions: why?

  6. States s0 s0 s1 s1 s2 s2 sn sn w1 w1 w2 w2 wn wn • States encode what is relevant about the past • Transitions P(s|s’) encode well-formed tag sequences • In a bigram tagger, states = tags • In a trigram tagger, states = tag pairs <> < t1> < t2> < tn> <,> < , t1> < t1, t2> < tn-1, tn>

  7. Estimating Transitions • Use standard smoothing methods to estimate transitions: • Can get a lot fancier (e.g. KN smoothing) or use higher orders, but in this case it doesn’t buy much • One option: encode more into the state, e.g. whether the previous word was capitalized (Brants 00) • BIG IDEA: The basic approach of state-splitting turns out to be very important in a range of tasks

  8. Estimating Emissions • Emissions are trickier: • Words we’ve never seen before • Words which occur with tags we’ve never seen them with • One option: break out the Good-Turning smoothing • Issue: unknown words aren’t black boxes: • Basic solution: unknown words classes (affixes or shapes) • [Brants 00] used a suffix trie as its emission model 343,127.23 11-year Minteria reintroducibly D+,D+.D+ D+-x+ Xx+ x+-“ly”

  9. Disambiguation (Inference) • Problem: find the most likely (Viterbi) sequence under the model • Given model parameters, we can score any tag sequence • In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence) <,> <,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP> NNP VBZ NN NNS CD NN . Fed raises interest rates 0.5 percent . P(NNP|<,>) P(Fed|NNP) P(VBZ|<NNP,>) P(raises|VBZ) P(NN|VBZ,NNP)….. logP = -23 NNP VBZ NN NNS CD NN NNP NNS NN NNS CD NN logP = -29 NNP VBZ VB NNS CD NN logP = -27

  10. Finding the Best Trajectory • Too many trajectories (state sequences) to list • Option 1: Beam Search • A beam is a set of partial hypotheses • Start with just the single empty trajectory • At each derivation step: • Consider all continuations of previous hypotheses • Discard most, keep top k, or those within a factor of the best • Beam search works ok in practice • … but sometimes you want the optimal answer • … and you need optimal answers to validate your beam search • … and there’s usually a better option than naïve beams Fed:NNP raises:NNS Fed:NNP Fed:NNP raises:VBZ <> Fed:VBN Fed:VBN raises:NNS Fed:VBD Fed:VBN raises:VBZ

  11. The State Lattice / Trellis ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END

  12. The State Lattice / Trellis ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END

  13. The Viterbi Algorithm • Dynamic program for computing • The score of a best path up to position i ending in state s • Also can store a backtrace (but no one does) • Memoized solution • Iterative solution

  14. So How Well Does It Work? • Choose the most common tag • 90.3% with a bad unknown word model • 93.7% with a good one • TnT (Brants, 2000): • A carefully smoothed trigram tagger • Suffix trees for emissions • 96.7% on WSJ text (SOA is ~97.5%) • Noise in the data • Many errors in the training and test corpora • Probably about 2% guaranteed error from noise (on this data) JJ JJ NN chief executive officer NN JJ NN chief executive officer JJ NN NN chief executive officer DT NN IN NNVBDNNS VBD The average of interbankofferedrates plummeted … NN NN NN chief executive officer

  15. Overview: Accuracies • Roadmap of (known / unknown) accuracies: • Most freq tag: ~90% / ~50% • Trigram HMM: ~95% / ~55% • TnT (HMM++): 96.2% / 86.0% • Maxent P(t|w): 93.7% / 82.6% • MEMM tagger: 96.9% / 86.9% • Cyclic tagger: 97.2% / 89.0% • Upper bound: ~98% Most errors on unknown words

  16. Common Errors • Common errors [from Toutanova & Manning 00] VBD RP/IN DT NN made up the story NN/JJ NN official knowledge RB VBD/VBN NNS recently sold shares

  17. See you tomorrow Corpus-Based MT Hasta pronto Yo lo haré pronto I will do it soon I will do it around Novel Sentence See you around Model of translation Modeling correspondences between languages Sentence-aligned parallel corpus: Yo lo haré mañana Hasta pronto I will do it tomorrow See you soon Machine translation system:

  18. Phrase-Based Systems cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Phrase table (translation model) Sentence-aligned corpus Word alignments Many slides and examples from Philipp Koehn or John DeNero

  19. Phrase-Based Decoding 这 7人 中包括 来自 法国 和 俄罗斯 的 宇航 员 . Decoder design is important: [Koehn et al. 03]

  20. The Pharaoh “Model” [Koehn et al, 2003] Segmentation Translation Distortion

  21. The Pharaoh “Model” Where do we get these counts?

  22. Phrase Weights

  23. Phrase-Based Decoding

  24. Monotonic Word Translation • Cost is LM * TM • It’s an HMM? • P(e|e-1,e-2) • P(f|e) • State includes • Exposed English • Position in foreign • Dynamic program loop? a slap to 0.02 a <- to 0.8 […. slap to, 6] 0.00000016 […. a slap, 5] 0.00001 […. slap by, 6] 0.00000001 a <- by 0.1 a slap by 0.01 for (fPosition in 1…|f|) for (eContext in allEContexts) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext) * TM(eOption, fWord[fPosition]) scores[fPosition][eContext[2]+eOption] =max score

  25. Beam Decoding • For real MT models, this kind of dynamic program is a disaster (why?) • Standard solution is beam search: for each position, keep track of only the best k hypotheses • Still pretty slow… why? • Useful trick: cube pruning (Chiang 2005) for (fPosition in 1…|f|) for (eContext in bestEContexts[fPosition]) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext) * TM(eOption, fWord[fPosition]) bestEContexts.maybeAdd(eContext[2]+eOption, score) Example from David Chiang

  26. Phrase Translation • If monotonic, almost an HMM; technically a semi-HMM • If distortion… now what? for (fPosition in 1…|f|) for (lastPosition < fPosition) for (eContext in eContexts) for (eOption in translations[fPosition]) … combine hypothesis for (lastPosition ending in eContext) with eOption

  27. Non-Monotonic Phrasal MT

  28. Pruning: Beams + Forward Costs • Problem: easy partial analyses are cheaper • Solution 1: use beams per foreign subset • Solution 2: estimate forward costs (A*-like)

  29. The Pharaoh Decoder

  30. Hypotheis Lattices

  31. Better Features • Can do surprisingly well just looking at a word by itself: • Word the: the  DT • Lowercased word Importantly: importantly  RB • Prefixes unfathomable: un-  JJ • Suffixes Surprisingly: -ly  RB • Capitalization Meridian: CAP  NNP • Word shapes 35-year: d-x  JJ • Then build a maxent (or whatever) model to predict tag • Maxent P(t|w): 93.7% / 82.6% s3 w3

  32. Why Linear Context is Useful • Lots of rich local information! • We could fix this with a feature that looked at the next word • We could fix this by linking capitalized words to their lowercase versions • Solution: discriminative sequence models (MEMMs, CRFs) • Reality check: • Taggers are already pretty good on WSJ journal text… • What the world needs is taggers that work on other text! • Though: other tasks like IE have used the same methods to good effect RB PRP VBD IN RB IN PRP VBD . They left as soon as he arrived . JJ NNP NNS VBD VBN . Intrinsic flaws remained undetected .

  33. Sequence-Free Tagging? • What about looking at a word and its environment, but no sequence information? • Add in previous / next word the __ • Previous / next word shapes X __ X • Occurrence pattern features [X: x X occurs] • Crude entity detection __ ….. (Inc.|Co.) • Phrasal verb in sentence? put …… __ • Conjunctions of these things • All features except sequence: 96.6% / 86.8% • Uses lots of features: > 200K • Why isn’t this the standard approach? t3 w3 w4 w2

  34. Feature-Rich Sequence Models • Problem: HMMs make it hard to work with arbitrary features of a sentence • Example: name entity recognition (NER) PER PER O O O O O O ORG O O O O O LOC LOC O Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road . Local Context

  35. MEMM Taggers • Idea: left-to-right local decisions, condition on previous tags and also entire input • Train up P(ti|w,ti-1,ti-2) as a normal maxent model, then use to score sequences • This is referred to as an MEMM tagger [Ratnaparkhi 96] • Beam search effective! (Why?) • What about beam size 1?

  36. Decoding • Decoding MEMM taggers: • Just like decoding HMMs, different local scores • Viterbi, beam search, posterior decoding • Viterbi algorithm (HMMs): • Viterbi algorithm (MEMMs): • General:

  37. Maximum Entropy II • Remember: maximum entropy objective • Problem: lots of features allow perfect fit to training set • Regularization (compare to smoothing)

  38. Derivative for Maximum Entropy Expected count of feature n in predicted candidates Big weights are bad Total count of feature n in correct candidates

  39. Example: NER Regularization Feature Weights Because of regularization term, the more common prefixes have larger weights even though entire-word features are more specific. Local Context

  40. Perceptron Taggers [Collins 01] • Linear models: • … that decompose along the sequence • … allow us to predict with the Viterbi algorithm • … which means we can train with the perceptron algorithm (or related updates, like MIRA)

  41. Conditional Random Fields • Make a maxent model over entire taggings • MEMM • CRF

  42. CRFs • Like any maxent model, derivative is: • So all we need is to be able to compute the expectation of each feature (for example the number of times the label pair DT-NN occurs, or the number of times NN-interest occurs) • Critical quantity: counts of posterior marginals:

  43. Computing Posterior Marginals • How many (expected) times is word w tagged with s? • How to compute that marginal? ^ ^ ^ ^ ^ ^ N N N N N N V V V V V V J J J J J J D D D D D D $ $ $ $ $ $ START Fed raises interest rates END

  44. TBL Tagger • [Brill 95] presents a transformation-based tagger • Label the training set with most frequent tags DT MD VBD VBD . The can was rusted . • Add transformation rules which reduce training mistakes • MD  NN : DT __ • VBD  VBN : VBD __ . • Stop when no transformations do sufficient good • Does this remind anyone of anything? • Probably the most widely used tagger (esp. outside NLP) • … but definitely not the most accurate: 96.6% / 82.0 %

  45. TBL Tagger II • What gets learned? [from Brill 95]

  46. EngCG Tagger • English constraint grammar tagger • [Tapanainen and Voutilainen 94] • Something else you should know about • Hand-written and knowledge driven • “Don’t guess if you know” (general point about modeling more structure!) • Tag set doesn’t make all of the hard distinctions as the standard tag set (e.g. JJ/NN) • They get stellar accuracies: 99% on their tag set • Linguistic representation matters… • … but it’s easier to win when you make up the rules

  47. Domain Effects • Accuracies degrade outside of domain • Up to triple error rate • Usually make the most errors on the things you care about in the domain (e.g. protein names) • Open questions • How to effectively exploit unlabeled data from a new domain (what could we gain?) • How to best incorporate domain lexica in a principled way (e.g. UMLS specialist lexicon, ontologies)

More Related