1 / 43

LING/C SC 581: Advanced Computational Linguistics

Dive into the advanced realm of computational linguistics by exploring N-grams, language models, and text analysis techniques. Learn about corpus frequency information, minimum edit distance algorithms, and the importance of smoothing for accurate language modeling.

laskowski
Download Presentation

LING/C SC 581: Advanced Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22nd

  2. Today's Topics • Minimum Edit Distance Homework • Corpora: frequency information • tregex

  3. Minimum Edit Distance Homework • Background: • … about 20% of the time “Britney Spears” is misspelled when people search for it on Google • Software for generating misspellings • If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings. • http://www.geneffects.com/typopositive/

  4. Minimum Edit Distance Homework • http://www.google.com/jobs/archive/britney.html Top six misspellings • Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): • e.g. ED(brittany) < ED(britany)

  5. Minimum Edit Distance Homework • Submit your homework in PDF • how many you got right • explain your criteria, e.g. weights, chosen • you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well • due by email to me before next Thursday class… • put your name and 581 at the top of your submission

  6. Part 2 • Corpora: frequency information • Unlabeled corpus: just words • Labeled corpus: various kinds … • POS information • Information about phrases • Word sense or Semantic role labeling easy to find progressively harder to create or obtain

  7. Language Models and N-grams • given a word sequence • w1 w2 w3 ... wn • chain rule • how to compute the probability of a sequence of words • p(w1 w2) = p(w1) p(w2|w1) • p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) • ... • p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) • note • It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences

  8. Language Models and N-grams • Given a word sequence • w1 w2 w3 ... wn • Bigram approximation • just look at the previous word only (not all the proceedings words) • Markov Assumption: finite length history • 1st order Markov Model • p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1) • p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) • note • p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2 wn-1)

  9. Language Models and N-grams • Trigram approximation • 2nd order Markov Model • just look at the preceding two words only • p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) • p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1) • note • p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )

  10. Language Models and N-grams • estimating from corpora • how to compute bigram probabilities • p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word • Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1 • p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency • Note: • The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

  11. Motivation for smoothing • Smoothing: avoid zero probability estimates • Consider • what happens when any individual probability component is zero? • Arithmetic multiplication law: 0×X = 0 • very brittle! • even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency • particularly so for larger n-grams p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)

  12. bigram probabilities Language Models and N-grams • Example: wn-1wn bigram frequencies wn wn-1 unigram frequencies sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing)

  13. Smoothing and N-grams • sparse dataset means zeros are a problem • Zero probabilities are a problem • p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model • one zero and the whole product is zero • Zero frequencies are a problem • p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency • bigram f(wn-1wn) doesn’t exist in dataset • smoothing • refers to ways of assigning zero probability n-grams a non-zero value

  14. Smoothing and N-grams • Add-One Smoothing (4.5.1 Laplace Smoothing) • add 1 to all frequency counts • simple and no more zeros (but there are better methods) • unigram • p(w) = f(w)/N (before Add-One) • N = size of corpus • p(w) = (f(w)+1)/(N+V) (with Add-One) • f*(w) = (f(w)+1)*N/(N+V) (with Add-One) • V = number of distinct words in corpus • N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One • bigram • p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One) • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One) • f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One) must rescale so that total probability mass stays at 1

  15. Smoothing and N-grams • Add-One Smoothing • add 1 to all frequency counts • bigram • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) • (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) • frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.4 = figure 6.8

  16. Smoothing and N-grams • Add-One Smoothing • add 1 to all frequency counts • bigram • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) • (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) • Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7

  17. Smoothing and N-grams • let’s illustrate the problem • take the bigram case: • wn-1wn • p(wn|wn-1) = f(wn-1wn)/f(wn-1) • suppose there are cases • wn-1wzero1that don’t occur in the corpus probability mass f(wn-1wn) f(wn-1) f(wn-1wzero1)=0 ... f(wn-1wzerom)=0

  18. Smoothing and N-grams • add-one • “give everyone 1” probability mass f(wn-1wn)+1 f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1

  19. V = |{wi}| Smoothing and N-grams • add-one • “give everyone 1” probability mass f(wn-1wn)+1 • redistribution of probability mass • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1

  20. Smoothing and N-grams • Good-Turing Discounting (4.5.2) • Nc = number of things (= n-grams) that occur c times in the corpus • N = total number of things seen • Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc • Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet • estimate N0 in terms of N1… • and so on but if Nc=0, smooth that first using something like log(Nc)=a+blog(c) • Formula: P*(things with zero freq) = N1/N • smaller impact than Add-One • Textbook Example: • Fishing in lake with 8 species • bass, carp, catfish, eel, perch, salmon, trout, whitefish • Sample data (6 out of 8 species): • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel • P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 • P(next fish=trout) = 1/18 • (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) • revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) • revised P(next fish=trout) = 0.67/18 = 0.037

  21. Language Models and N-grams • N-gram models + smoothing • one consequence of smoothing is that • every possible concatentation or sequence of words has a non-zero probability • N-gram models can also incorporate word classes, e.g. POS labels when available

  22. Language Models and N-grams • N-gram models • data is easy to obtain • any unlabeled corpus will do • they’re technically easy to compute • count frequencies and apply the smoothing formula • but just how good are these n-gram language models? • and what can they show us about language?

  23. Language Models and N-grams approximating Shakespeare • generate random sentences using n-grams • Corpus: CompleteWorks of Shakespeare • Unigram (pick random, unconnected words) • Bigram

  24. Language Models and N-grams • Approximating Shakespeare • generate random sentences using n-grams • Corpus: CompleteWorks of Shakespeare • Trigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n • Quadrigram

  25. Language Models and N-grams • A limitation: • produces ungrammatical sequences • Treebank: • potential to be a better language model • Structural information: • contains frequency information about syntactic rules • we should be able to generate sequences that are closer to English…

  26. Language Models and N-grams • Aside: http://hemispheresmagazine.com/contests/2004/intro.htm

  27. Part 3 tregex • I assume everyone has: • Installed Penn Treebank v3 • Downloaded and installed tregex

  28. Trees in the Penn Treebank Directory: TREEBANK_3/parsed/mrg/ Notation: LISP S-expression

  29. tregex • Search Example: << dominates, < immediately dominates

  30. tregex Help

  31. tregex Help

  32. tregex • Help: tregex expression syntax is non-standard wrt bracketing S < VP S < NP

  33. tregex • Help: tregexboolean syntax is also non-standard

  34. tregex • Help

  35. tregex • Help

  36. tregex same node • Pattern: • (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma) Key: <, first child $+ immediate left sister <- last child

  37. tregex • Help

  38. tregex

  39. tregex • Different results from: • @SBAR < /^WH.*-([0-9]+)$/#1%index << (@NP < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))

  40. tregex Example: WHADVP also possible (not just WHNP)

  41. Treebank Guides Tagging Guide Arpa94 paper Parse Guide

  42. Treebank Guides • Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages): tagguid2.pdf: addendum, see POS tag ‘TO’

  43. Treebank Guides • Parsing guide 1, prsguid1.pdf (318 pages): prsguid2.pdf: addendum for the Switchboard corpus

More Related