430 likes | 452 Views
Dive into the advanced realm of computational linguistics by exploring N-grams, language models, and text analysis techniques. Learn about corpus frequency information, minimum edit distance algorithms, and the importance of smoothing for accurate language modeling.
E N D
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22nd
Today's Topics • Minimum Edit Distance Homework • Corpora: frequency information • tregex
Minimum Edit Distance Homework • Background: • … about 20% of the time “Britney Spears” is misspelled when people search for it on Google • Software for generating misspellings • If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings. • http://www.geneffects.com/typopositive/
Minimum Edit Distance Homework • http://www.google.com/jobs/archive/britney.html Top six misspellings • Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): • e.g. ED(brittany) < ED(britany)
Minimum Edit Distance Homework • Submit your homework in PDF • how many you got right • explain your criteria, e.g. weights, chosen • you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well • due by email to me before next Thursday class… • put your name and 581 at the top of your submission
Part 2 • Corpora: frequency information • Unlabeled corpus: just words • Labeled corpus: various kinds … • POS information • Information about phrases • Word sense or Semantic role labeling easy to find progressively harder to create or obtain
Language Models and N-grams • given a word sequence • w1 w2 w3 ... wn • chain rule • how to compute the probability of a sequence of words • p(w1 w2) = p(w1) p(w2|w1) • p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) • ... • p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) • note • It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences
Language Models and N-grams • Given a word sequence • w1 w2 w3 ... wn • Bigram approximation • just look at the previous word only (not all the proceedings words) • Markov Assumption: finite length history • 1st order Markov Model • p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1) • p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) • note • p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2 wn-1)
Language Models and N-grams • Trigram approximation • 2nd order Markov Model • just look at the preceding two words only • p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1) • p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1) • note • p(wn|wn-2wn-1) is a lot easier to estimate well than p(wn|w1...wn-2 wn-1) but harder than p(wn|wn-1 )
Language Models and N-grams • estimating from corpora • how to compute bigram probabilities • p(wn|wn-1) = f(wn-1wn)/f(wn-1w) w is any word • Since f(wn-1w) = f(wn-1) f(wn-1) = unigram frequency for wn-1 • p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency • Note: • The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
Motivation for smoothing • Smoothing: avoid zero probability estimates • Consider • what happens when any individual probability component is zero? • Arithmetic multiplication law: 0×X = 0 • very brittle! • even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency • particularly so for larger n-grams p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1)
bigram probabilities Language Models and N-grams • Example: wn-1wn bigram frequencies wn wn-1 unigram frequencies sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing)
Smoothing and N-grams • sparse dataset means zeros are a problem • Zero probabilities are a problem • p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) bigram model • one zero and the whole product is zero • Zero frequencies are a problem • p(wn|wn-1) = f(wn-1wn)/f(wn-1) relative frequency • bigram f(wn-1wn) doesn’t exist in dataset • smoothing • refers to ways of assigning zero probability n-grams a non-zero value
Smoothing and N-grams • Add-One Smoothing (4.5.1 Laplace Smoothing) • add 1 to all frequency counts • simple and no more zeros (but there are better methods) • unigram • p(w) = f(w)/N (before Add-One) • N = size of corpus • p(w) = (f(w)+1)/(N+V) (with Add-One) • f*(w) = (f(w)+1)*N/(N+V) (with Add-One) • V = number of distinct words in corpus • N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One • bigram • p(wn|wn-1) = f(wn-1wn)/f(wn-1) (before Add-One) • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) (after Add-One) • f*(wn-1 wn) = (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) (after Add-One) must rescale so that total probability mass stays at 1
Smoothing and N-grams • Add-One Smoothing • add 1 to all frequency counts • bigram • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) • (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) • frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = figure 6.4 = figure 6.8
Smoothing and N-grams • Add-One Smoothing • add 1 to all frequency counts • bigram • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) • (f(wn-1 wn)+1)* f(wn-1) /(f(wn-1)+V) • Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7
Smoothing and N-grams • let’s illustrate the problem • take the bigram case: • wn-1wn • p(wn|wn-1) = f(wn-1wn)/f(wn-1) • suppose there are cases • wn-1wzero1that don’t occur in the corpus probability mass f(wn-1wn) f(wn-1) f(wn-1wzero1)=0 ... f(wn-1wzerom)=0
Smoothing and N-grams • add-one • “give everyone 1” probability mass f(wn-1wn)+1 f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1
V = |{wi}| Smoothing and N-grams • add-one • “give everyone 1” probability mass f(wn-1wn)+1 • redistribution of probability mass • p(wn|wn-1) = (f(wn-1wn)+1)/(f(wn-1)+V) f(wn-1) f(wn-1w01)=1 ... f(wn-1w0m)=1
Smoothing and N-grams • Good-Turing Discounting (4.5.2) • Nc = number of things (= n-grams) that occur c times in the corpus • N = total number of things seen • Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc • Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet • estimate N0 in terms of N1… • and so on but if Nc=0, smooth that first using something like log(Nc)=a+blog(c) • Formula: P*(things with zero freq) = N1/N • smaller impact than Add-One • Textbook Example: • Fishing in lake with 8 species • bass, carp, catfish, eel, perch, salmon, trout, whitefish • Sample data (6 out of 8 species): • 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel • P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 • P(next fish=trout) = 1/18 • (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) • revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) • revised P(next fish=trout) = 0.67/18 = 0.037
Language Models and N-grams • N-gram models + smoothing • one consequence of smoothing is that • every possible concatentation or sequence of words has a non-zero probability • N-gram models can also incorporate word classes, e.g. POS labels when available
Language Models and N-grams • N-gram models • data is easy to obtain • any unlabeled corpus will do • they’re technically easy to compute • count frequencies and apply the smoothing formula • but just how good are these n-gram language models? • and what can they show us about language?
Language Models and N-grams approximating Shakespeare • generate random sentences using n-grams • Corpus: CompleteWorks of Shakespeare • Unigram (pick random, unconnected words) • Bigram
Language Models and N-grams • Approximating Shakespeare • generate random sentences using n-grams • Corpus: CompleteWorks of Shakespeare • Trigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n • Quadrigram
Language Models and N-grams • A limitation: • produces ungrammatical sequences • Treebank: • potential to be a better language model • Structural information: • contains frequency information about syntactic rules • we should be able to generate sequences that are closer to English…
Language Models and N-grams • Aside: http://hemispheresmagazine.com/contests/2004/intro.htm
Part 3 tregex • I assume everyone has: • Installed Penn Treebank v3 • Downloaded and installed tregex
Trees in the Penn Treebank Directory: TREEBANK_3/parsed/mrg/ Notation: LISP S-expression
tregex • Search Example: << dominates, < immediately dominates
tregex Help
tregex Help
tregex • Help: tregex expression syntax is non-standard wrt bracketing S < VP S < NP
tregex • Help: tregexboolean syntax is also non-standard
tregex • Help
tregex • Help
tregex same node • Pattern: • (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma) Key: <, first child $+ immediate left sister <- last child
tregex • Help
tregex • Different results from: • @SBAR < /^WH.*-([0-9]+)$/#1%index << (@NP < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
tregex Example: WHADVP also possible (not just WHNP)
Treebank Guides Tagging Guide Arpa94 paper Parse Guide
Treebank Guides • Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages): tagguid2.pdf: addendum, see POS tag ‘TO’
Treebank Guides • Parsing guide 1, prsguid1.pdf (318 pages): prsguid2.pdf: addendum for the Switchboard corpus