270 likes | 449 Views
CS 595-052 Machine Learning and Statistical Natural Language Processing. Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book : Statistical Natural Language Processing C. D. Manning and H. Sch ü tze Requirements : Several programming projects
E N D
CS 595-052 Machine Learning and StatisticalNatural Language Processing Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: • Several programming projects • Research Proposal
Machine Learning Test Examples Learning Algorithm Learned Model Training Examples Classification/ Labeling Results
Modeling • Decide how to represent learned models: • Decision rules • Linear functions • Markov models • … • Type chosen affects generalization accuracy (on new data)
Example Representation • Set of Features: • Continuous • Discrete (ordered and unordered) • Binary • Sets vs. Sequences • Classes: • Continuous vs. discrete • Binary vs. multivalued • Disjoint vs. overlapping
Learning Algorithms • Find a “good” hypothesis “consistent” with the training data • Many hypotheses may be consistent, so may need a “preference bias” • No hypothesis may be consistent, so need to find “nearly” consistent • May rule out some hypotheses to start with: • Feature reduction
Estimating Generalization Accuracy • Accuracy on the training says nothing about new examples! • Must train and test on different example sets • Estimate generalization accuracy over multiple train/test divisions • Sources of estimation error: • Bias: Systematic error in the estimate • Variance: How much the estimate changes between different runs
Cross-validation • Divide training into k sets • Repeat for each set: • Train on the remaining k-1 sets • Test on the kth • Average k accuracies (and compute statistics)
Bootstrapping For a corpus of n examples: • Choose n examples randomly (with replacement) Note: We expect ~0.632ndifferent examples • Train model, and evaluate: • acc0 = accuracy of model on non-chosen examples • accS = accuracy of model on n training examples • Estimate accuracy as 0.632*acc0 + 0.368*accS • Average accuracies over b different runs Also note: there are other similar bootstrapping techniques
Bootstrapping vs. Cross-validation • Cross-validation: • Equal participation of all examples • Dependency of class distribution in tests on distributions in training • Stratified cross-validation: equalize class dist. • Bootstrap: • Often has higher bias (fewer distinct examples) • Best for small datasets
Natural Language Processing • Extract useful information from natural language texts (articles, books, web pages, queries, etc.) • Traditional method: Handcrafted lexicons, grammars, parsers • Statistical approach: Learn how to process language from a corpus of real usage
Some Statistical NLP Tasks • Part of speech tagging - How to distinguish between book the noun, and book the verb. • Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going • Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. • Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.
A Paradigmatic Task • Language Modeling: Predict the next word of a text (probabilistically): P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1) • To do this perfectly, we must capture true notions of grammaticality • So: Better approximation of prob. of “the next word” Better language model
Measuring “Surprise” • The lower the probability of the actual word, the more the model is “surprised”: H(wn | w1…wn-1) = -log2m(wn | w1…wn-1) (The conditional entropy of wn given w1,n-1) Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = wp(wn=w|w1,n-1)H(wn=w|w1,n-1) = Ep[-log2m(wn | w1,n-1)]
Estimating the Cross-Entropy How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p? Assume: • Stationarity: The language doesn’t change • Ergodicity: The language never gets “stuck” Then: Ep[H(wn|w1,n-1)] = limn (1/n) nH(wn | w1,n-1)
Perplexity Commonly used measure of “model fit”: perplexity(w1,n,m) = 2H(w1,n,m) = m(w1,n)-(1/n) How many “choices” for next word on average? • Lower perplexity = better model
N-gram Models • Assume a “limited horizon”: P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) • Each word depends only on the last n-1 words • Specific cases: • Unigram model: P(wk) – words independent • Bigram model: P(wk | wk-1) • Learning task: estimate these probabilities from a given corpus
Using Bigrams • Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat) P(on|sat)P(the|on)P(mat|the)P(END|mat) • Generate a random text and examine for “reasonableness”
Maximum Likelihood Estimation • PMLE(w1…wn) = C(w1…wn) / N • PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1) • Problem:Data Sparseness!! • For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus • The larger the context, the greater the problem • But there are always new cases not seen before!
Smoothing • Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B where X is any entity, B is the number of entity types • Problem: Assigns too much probability to new events The more event types there are, the worse this becomes
Interpolation Lidstone: PLid(X) = (C(X) + d) / (N + dB) [ d < 1 ] Johnson: PLid(X) = m PMLE(X) + (1 – m)(1/B) where m = N/(N+dB) • How to choose d? • Doesn’t match low-frequency events well
Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data • Divide data: “training” &“held out” subsets C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r
Deleted Estimation Generalize to use all the data : • Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb = X:Ca(X)=r Cb(X) Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r] • Needs a large data set • Overestimates unseen data, underestimates infrequent data
Good-Turing For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] • The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N • So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)
Good-Turing Issues • Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0) Usual answers: • Use only for low-frequency items (r < k) • Smooth E[Nr] by function S(r) • How to divide probability among unseen items? • Uniform distribution • Estimate which seem more likely than others…
{ (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1) if enough data (wi-n+1,i-1)Pbo(wi|wi-n+2,i-1) otherwise Back-off Models • If high-order n-gram has insufficient data, use lower order n-gram: Pbo(wi|wi-n+1,i-1) = • Note recursive formulation
Linear Interpolation More generally, we can interpolate: Pint(wi|h) = kk(h)Pk(wi| h) • Interpolation between different orders • Usually set weights by iterative training (gradient descent – EM algorithm) • Partition histories h into equivalence classes • Need to be responsive to the amount of data!