450 likes | 687 Views
6. N-GRAMs. 부산대학교 인공지능연구실 최성자. Word prediction. “I’d like to make a collect …” Call, telephone, or person-to-person Spelling error detection Augmentative communication Context-sensitive spelling error correction. Language Model. Language Model (LM)
E N D
6. N-GRAMs 부산대학교 인공지능연구실 최성자
Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person • Spelling error detection • Augmentative communication • Context-sensitive spelling error correction
Language Model • Language Model (LM) • statistical model of word sequences • n-gram: Use the previous n -1 words to predict the next word
Applications • context-sensitive spelling error detection and correction “He is trying to fine out.” “The design an construction will take a year.” • machine translation
Counting Words in Corpora • Corpora (on-line text collections) • Which words to count • What we are going to count • Where we are going to find the things to count
Brown Corpus • 1 million words • 500 texts • Varied genres (newspaper, novels, non-fiction, academic, etc.) • Assembled at Brown University in 1963-64 • The first large on-line text collection used in corpus-based NLP research
Issues in Word Counting • Punctuation symbols (. , ? !) • Capitalization (“He” vs. “he”, “Bush” vs. “bush”) • Inflected forms (“cat” vs. “cats”) • Wordform: cat, cats, eat, eats, ate, eating, eaten • Lemma (Stem): cat, eat
Types vs. Tokens • Tokens (N): Total number of running words • Types (B): Number of distinct words in a corpus (size of the vocabulary) Example: “They picnicked by the pool, then lay back on the grass and looked at the stars.” –16 word tokens, 14 word types (not counting punctuation) ※ “Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word
How Many Words in English? • Shakespeare’s complete works • 884,647 wordform tokens • 29,066 wordform types • Brown Corpus • 1 million wordform tokens • 61,805 wordform types • 37,851 lemma types
Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • First attempt: • Suppose there is no corpus available • Use uniform distribution • Assume: • word types = V (e.g., 100,000)
Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Second attempt: • Suppose there is a corpus • Assume: • word tokens = N • # times w appears in corpus = C(w)
Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Third attempt: • Suppose there is a corpus • Assume a word depends on its n –1 previous words
Simple (Unsmoothed) N-grams • n-gram approximation: • Wk only depends on its previous n–1words
Bigram Approximation • Example: P(I want to eat British food) = P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) <s>: a special word meaning “start of sentence”
Note on Practical Problem • Multiplying many probabilities results in a very small number and can cause numerical underflow • Use logprob instead in the actual computation
Estimating N-gram Probability • Maximum Likelihood Estimate (MLE)
Estimating Bigram Probability • Example: • C(to eat) = 860 • C(to) = 3256
Two Important facts • The increasing accuracy of N-gram models as we increse the value of N • Very strong dependency on their training corpus (in particular its genre and its size in words)
Smoothing • Any particular training corpus is finite • Sparse data problem • Deal with zero probability
Smoothing • Smoothing • Reevaluating zero probability n-grams and assigning them non-zero probability • Also called Discounting • Lowering non-zero n-gram counts in order to assign some probability mass to the zero n-grams
Things Seen Once • Use the count of things seen once to help estimate the count of things never seen
Entropy • Measure of uncertainty • Used to evaluate quality of n-gram models (how well a language model matches a given language) • Entropy H(X) of a random variable X: • Measured in bits • Number of bits to encode information in the optimal coding scheme
Cross Entropy • Used for comparing two language models • p: Actual probability distribution that generated some data • m: A model of p (approximation to p) • Cross entropy of m on p:
Cross Entropy • By Shannon-McMillan-Breimantheorem: • Property of cross entropy: • Difference between H(p,m) and H(p) is a measure of how accurate model m is • The more accurate a model, the lower its cross-entropy