1 / 27

CS 595-052 Machine Learning and Statistical Natural Language Processing

CS 595-052 Machine Learning and Statistical Natural Language Processing. Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book : Statistical Natural Language Processing C. D. Manning and H. Sch ü tze Requirements : Several programming projects

mary-beard
Download Presentation

CS 595-052 Machine Learning and Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 595-052 Machine Learning and StatisticalNatural Language Processing Prof. Shlomo Argamon, argamon@iit.edu Room: 237C Office Hours: Mon 3-4 PM Book: Statistical Natural Language Processing C. D. Manning and H. Schütze Requirements: • Several programming projects • Research Proposal

  2. Machine Learning Test Examples Learning Algorithm Learned Model Training Examples Classification/ Labeling Results

  3. Modeling • Decide how to represent learned models: • Decision rules • Linear functions • Markov models • … • Type chosen affects generalization accuracy (on new data)

  4. Generalization

  5. Example Representation • Set of Features: • Continuous • Discrete (ordered and unordered) • Binary • Sets vs. Sequences • Classes: • Continuous vs. discrete • Binary vs. multivalued • Disjoint vs. overlapping

  6. Learning Algorithms • Find a “good” hypothesis “consistent” with the training data • Many hypotheses may be consistent, so may need a “preference bias” • No hypothesis may be consistent, so need to find “nearly” consistent • May rule out some hypotheses to start with: • Feature reduction

  7. Estimating Generalization Accuracy • Accuracy on the training says nothing about new examples! • Must train and test on different example sets • Estimate generalization accuracy over multiple train/test divisions • Sources of estimation error: • Bias: Systematic error in the estimate • Variance: How much the estimate changes between different runs

  8. Cross-validation • Divide training into k sets • Repeat for each set: • Train on the remaining k-1 sets • Test on the kth • Average k accuracies (and compute statistics)

  9. Bootstrapping For a corpus of n examples: • Choose n examples randomly (with replacement) Note: We expect ~0.632ndifferent examples • Train model, and evaluate: • acc0 = accuracy of model on non-chosen examples • accS = accuracy of model on n training examples • Estimate accuracy as 0.632*acc0 + 0.368*accS • Average accuracies over b different runs Also note: there are other similar bootstrapping techniques

  10. Bootstrapping vs. Cross-validation • Cross-validation: • Equal participation of all examples • Dependency of class distribution in tests on distributions in training • Stratified cross-validation: equalize class dist. • Bootstrap: • Often has higher bias (fewer distinct examples) • Best for small datasets

  11. Natural Language Processing • Extract useful information from natural language texts (articles, books, web pages, queries, etc.) • Traditional method: Handcrafted lexicons, grammars, parsers • Statistical approach: Learn how to process language from a corpus of real usage

  12. Some Statistical NLP Tasks • Part of speech tagging - How to distinguish between book the noun, and book the verb. • Shallow parsing – Pick out phrases of different types from a text, such as the purple people eater or would have been going • Word sense disambiguation - How to distinguish between river bank and bank as a financial institution. • Alignment – Find the correspondence between words, sentences and paragraphs of a source text and its translation.

  13. A Paradigmatic Task • Language Modeling: Predict the next word of a text (probabilistically): P(wn | w1w2…wn-1) = m(wn | w1w2…wn-1) • To do this perfectly, we must capture true notions of grammaticality • So: Better approximation of prob. of “the next word”  Better language model

  14. Measuring “Surprise” • The lower the probability of the actual word, the more the model is “surprised”: H(wn | w1…wn-1) = -log2m(wn | w1…wn-1) (The conditional entropy of wn given w1,n-1) Cross-entropy: Suppose the actual distribution of the language is p(wn | w1…wn-1), then our model is on average surprised by: Ep[H(wn|w1,n-1)] = wp(wn=w|w1,n-1)H(wn=w|w1,n-1) = Ep[-log2m(wn | w1,n-1)]

  15. Estimating the Cross-Entropy How can we estimate Ep[H(wn|w1,n-1)] when we don’t (by definition) know p? Assume: • Stationarity: The language doesn’t change • Ergodicity: The language never gets “stuck” Then: Ep[H(wn|w1,n-1)] = limn (1/n) nH(wn | w1,n-1)

  16. Perplexity Commonly used measure of “model fit”: perplexity(w1,n,m) = 2H(w1,n,m) = m(w1,n)-(1/n) How many “choices” for next word on average? • Lower perplexity = better model

  17. N-gram Models • Assume a “limited horizon”: P(wk | w1w2…wk-1) = P(wk | wk-n…wk-1) • Each word depends only on the last n-1 words • Specific cases: • Unigram model: P(wk) – words independent • Bigram model: P(wk | wk-1) • Learning task: estimate these probabilities from a given corpus

  18. Using Bigrams • Compute probability of a sentence: W = The cat sat on the mat P(W) = P(The|START)P(cat|The)P(sat|cat)  P(on|sat)P(the|on)P(mat|the)P(END|mat) • Generate a random text and examine for “reasonableness”

  19. Maximum Likelihood Estimation • PMLE(w1…wn) = C(w1…wn) / N • PMLE(wn | w1…wn-1) = C(w1…wn) / C(w1…wn-1) • Problem:Data Sparseness!! • For the vast majority of possible n-grams, we get 0 probability, even in a very large corpus • The larger the context, the greater the problem • But there are always new cases not seen before!

  20. Smoothing • Idea: Take some probability away from seen events and assign it to unseen events Simple method (Laplace): Give every event an a priori count of 1 PLap(X) = C(X)+1 / N+B where X is any entity, B is the number of entity types • Problem: Assigns too much probability to new events The more event types there are, the worse this becomes

  21. Interpolation Lidstone: PLid(X) = (C(X) + d) / (N + dB) [ d < 1 ] Johnson: PLid(X) = m PMLE(X) + (1 – m)(1/B) where m = N/(N+dB) • How to choose d? • Doesn’t match low-frequency events well

  22. Held-out Estimation Idea: Estimate freq on unseen data from “unseen” data • Divide data: “training” &“held out” subsets C1(X) = freq of X in training data C2(X) = freq of X in held out data Tr = X:C1(X)=r C2(X) Pho(X) = Tr/(NrN) where C(X)=r

  23. Deleted Estimation Generalize to use all the data : • Divide data into 2 subsets: Nar = number of entities s.t. Ca(X)=r Tarb = X:Ca(X)=r Cb(X) Pdel (X) = (T0r1 + T1r0 ) / N(N0r1 + N1r0 ) [C(X)=r] • Needs a large data set • Overestimates unseen data, underestimates infrequent data

  24. Good-Turing For observed items, discount item count: r* = (r+1) E[Nr+1] / E[Nr] • The idea is that the chance of seeing the item one more time, is about E[Nr+1] / E[Nr] For unobserved items, total probability is: E[N1] / N • So, if we assume a uniform distribution over unknown items, we have: P(X) = E[N1] / (N0N)

  25. Good-Turing Issues • Has problems with high-frequency items (consider rmax* = E[Nrmax+1]/E[Nrmax] = 0) Usual answers: • Use only for low-frequency items (r < k) • Smooth E[Nr] by function S(r) • How to divide probability among unseen items? • Uniform distribution • Estimate which seem more likely than others…

  26. { (1-d(wi-n+1,i-1)) P(wi|wi-n+1,i-1) if enough data (wi-n+1,i-1)Pbo(wi|wi-n+2,i-1) otherwise Back-off Models • If high-order n-gram has insufficient data, use lower order n-gram: Pbo(wi|wi-n+1,i-1) = • Note recursive formulation

  27. Linear Interpolation More generally, we can interpolate: Pint(wi|h) = kk(h)Pk(wi| h) • Interpolation between different orders • Usually set weights by iterative training (gradient descent – EM algorithm) • Partition histories h into equivalence classes • Need to be responsive to the amount of data!

More Related