570 likes | 678 Views
SI 760 / EECS 597 / Ling 702 Language and Information. Handout #1. Winter 2004. Course Information. Instructor: Dragomir R. Radev (radev@umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: TBA Course page: http://www.si.umich.edu/~radev/LNI-winter2004/
E N D
SI 760 / EECS 597 / Ling 702Language and Information Handout #1 Winter 2004
Course Information • Instructor: Dragomir R. Radev (radev@umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: TBA • Course page:http://www.si.umich.edu/~radev/LNI-winter2004/ • Class meets on Mondays, 1-4 PM in 412 WH
Readings • Two introductions to statistical NLP • Collocations paper • Joshua Goodman’s language modeling tutorial • Documentation for the CMU LM toolkit
Word Prediction • Example: “I’d like to make a collect …” • “I have a gub” • “He is trying to fine out” • “Hopefully, all with continue smoothly in my absence” • “They are leaving in about fifteen minuets to go to her house” • “I need to notified the bank of [this problem] • Language model: a statistical model of word sequences
Counting Words • Brown corpus (1 million words from 500 texts) • Example: “He stepped out into the hall, was delighted to encounter a water brother” - how many words? • Word forms and lemmas. “cat” and “cats” share the same lemma (also tokens and types) • Shakespeare’s complete works: 884,647 word tokens and 29,066 word types • Brown corpus: 61,805 types and 37,851 lemmas • American Heritage 3rd edition has 200,000 “boldface forms” (including some multiword phrases)
Unsmoothed N-grams • First approximation: each word has an equal probability to follow any other. E.g., with 100,000 words, the probability of each of them at any given point is .00001 • “the” - 69,971 times in BC, while “rabbit” appears 11 times • “Just then, the white …” P(w1,w2,…, wn) = P(w1)P(w2 |w1)P(w3|w1w2)… P(wn |w1w2…wn-1)
A language model • The sum of probabilities of all strings has to be 1. • Bigram and trigram models • How do you estimate the probabilities? Bigram model: Replace P(wn |w1w2…wn-1) with P(wn|wn-1)
Perplexity of a language model • What is the perplexity of guessing a digit if all digits are equally likely? • How about a letter? • How about guessing A with a probability of 1/4, B with a probability of 1/2 and 10,000 other cases with a probability of 1/2 total (example modified from Joshua Goodman).
Perplexity across distributions • What if the actual distribution is very different from the expected one? • Example: all of the 10,000 other cases are equally likely but P(A) = P(B) = 0. • Cross-entropy = log (perplexity), measured in bits
Smoothing techniques • Imagine the following situation: you are looking at Web pages and you have to guess how many different languages they are written in. • First one is in English, then French, then again English, then Korean, then Chinese, etc.Total: 5F, 3E, 1K, 1C • Can you predict what the next language will be? • What is a problem with the simplest approach to this problem?
Smoothing • Why smoothing? • How many parameters are there in the model (given 100,000 possible words) • What are the unsmoothed (ML) estimates for unigrams, bigrams, trigrams? • Linear interpolation (mixture with λi). • How to estimate λi?
Example • Consider the problem of estimating bigram probabilities from a training corpus. • Probability mass must be 1. • How to account for unseen events?
Common methods • Add-1 smoothing (add one to numerator, add N to denominator) • Good-Turing smoothing • Best: Kneser-Ney
Markov Models • Assumption: we can predict the probability of some future item on the basis of a short history • Bigrams: first-level Markov models • Bigram grammars: as an N-by-N matrix of probabilities, where N is the size of the vocabulary that we are modeling.
The Noisy Channel Model • Source-channel model of communication • Parametric probabilistic models of language and translation • Training such models
Statistics • Given f, guess e f e e’ E F F E encoder decoder e’ = argmax P(e|f) = argmax P(f|e) P(e) e e translation model language model
Parametric probabilistic models • Language model (LM) • Deleted interpolation • Translation model (TM) P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1) P(eL|e1 … eK-1) P(eL|eL-2, eL-1) Alignment: P(f,a|e)
IBM’s EM trained models • Word translation • Local alignment • Fertilities • Class-based alignment • Non-deficient algorithm (avoid overlaps, overflow)
Bayesian formulas • argmaxe P(e | f) = ? • P(e|f) = P(e) * P(f | e) / P(f) • argmaxe P(e | f) = argmaxe P(e) * P(f | e) The rest of the slides in this section are based on “A Statistical MT Tutorial Workbook” by Kevin Knight
N-gram model • P(e) = ? • P(how's it going?) = 76,413/1,000,000,000 = 0.000076413 • Bigrams: b(y|x) = count (xy)/count (x) • P(“I like snakes that are not poisonous”) = P(“I”|start) * P(“like”|”I”) * … • Trigrams: b(z|xy) = ??
Smoothing • b(z | x y) = 0.95 * count (“xyz”) / count (“xy”) + 0.04 * count (“yz”) / count (“z”) + 0.008 * count (“z”) / totalwordsseen + 0.002
Ordering words (1) a a centre earthquake evacuation forced has historic Italian of of second southern strong the the village (2) met Paul Wendy
Translation models • Mary did not slap the green witch. • Mary not slap slap slap the the green witch • Mary no daba una botefada a la verde bruja • Mary no daba una botefada a la bruja verde
Translation models • Fertility • Permutation
IBM model 3 • Fertility • Spurious words (e0) • Pick words • Pick positions
Translation models • Mary did not slap the green witch. • Mary not slap slap slap the green witch • Mary not slap slap slap NULL the green witch • Mary no daba una botefada a la verde bruja • Mary no daba una botefada a la bruja verde
Parameters • N - fertility (x*x) • T - translation (x) • D - position (x) • P
Example NULL And the program has been implemented Le programme a ete mis en application
Alignments The blue house The blue house La maison bleue La maison bleue • Needed: P(a|f,e)
Motivation • Sequence of random variables that aren’t independent • Example: weather reports
Properties • Limited horizon: P(Xt+1 = sk|X1,…,Xt) = P(Xt+1 = sk|Xt) • Time invariant (stationary): = P(X2=sk|X1) • Definition: in terms of a transition matrix A and initial state probabilities .
Example 0.6 1.0 h a 0.4 p 0.4 0.3 1.0 0.6 0.3 1.0 e t i 0.4 start
Visible MM P(X1,…XT) = P(X1) P(X2|X1) P(X3|X1,X2) … P(XT|X1,…,XT-1) = P(X1) P(X2|X1) P(X3|X2) … P(XT|XT-1) P(t, i, p) = P(X1=t) P(X2=i|X1=t) P(X3=p|X2=i) = 1.0 x 0.3 x 0.6 = 0.18
Hidden MM 0.3 0.7 COLA ICE TEA 0.5 0.5 start
Hidden MM • P(Ot=k|Xt=si,Xt+1=sj) = bijk
Example • P(lemonade,icetea|COLA) = ? • P = 0.7 x 0.3 x 0.7 x 0.1 + 0.7 x 0.3 x 0.3 x 0.1 + 0.3 x 0.3 x 0.5 x 0.7 + 0.3 x 0.3 x 0.5 x 0.7 = 0.084
Hidden MM • Part of speech tagging, speech recognition, gene sequencing • Three tasks: • A=state transition probabilities, B=symbol emission probabilities, = initial state prob. • Given = (A,B, ), find P(O|) • Given O, , what is (X1,…XT+1) • Given O and a space of all possible , find model that best describes the observations
Collocations • Idioms • Free word combinations • Know a word by the company that it keeps (Firth) • Common use • No general syntactic or semantic rules • Important for non-native speakers
Examples Idioms To kick the bucket Dead end To catch up Collocations To trade actively Table of contents Orthogonal projection Free-word combinations To take the bus The end of the road To buy a house
Uses • Disambiguation (e.g, “bank”/”loan”,”river”) • Translation • Generation
Properties • Arbitrariness • Language- and dialect-specific • Common in technical language • Recurrent in context • (see Smadja 83)
Arbitrariness • Make an effort vs. *make an exertion • Running commentary vs. *running discussion • Commit treason vs. *commit treachery
Cross-lingual properties • Régler la circulation = direct traffic • Russian, German, Serbo-Croatian: direct translation is used • AE: set the table, make a decision • BE: lay the table, take a decision • “semer le désarroi” - “to sow disarray” - “to wreak havoc”
Types of collocations • Grammatical: come to, put on; afraid that, fond of, by accident, witness to • Semantic (only certain synonyms) • Flexible: find/discover/notice by chance
Base/collocator pairs Example Set the table Warm greetings Struggle desperately Sound asleep Put on Collocator verb adjective adverb adverb preposition Base Noun Noun Verb Adjective Verb