6. N-GRAMs

6. N-GRAMs 부산대학교 인공지능연구실 최성자

Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person • Spelling error detection • Augmentative communication • Context-sensitive spelling error correction

Language Model • Language Model (LM) • statistical model of word sequences • n-gram: Use the previous n -1 words to predict the next word

Applications • context-sensitive spelling error detection and correction “He is trying to fine out.” “The design an construction will take a year.” • machine translation

Counting Words in Corpora • Corpora (on-line text collections) • Which words to count • What we are going to count • Where we are going to find the things to count

Brown Corpus • 1 million words • 500 texts • Varied genres (newspaper, novels, non-fiction, academic, etc.) • Assembled at Brown University in 1963-64 • The first large on-line text collection used in corpus-based NLP research

Issues in Word Counting • Punctuation symbols (. , ? !) • Capitalization (“He” vs. “he”, “Bush” vs. “bush”) • Inflected forms (“cat” vs. “cats”) • Wordform: cat, cats, eat, eats, ate, eating, eaten • Lemma (Stem): cat, eat

Types vs. Tokens • Tokens (N): Total number of running words • Types (B): Number of distinct words in a corpus (size of the vocabulary) Example: “They picnicked by the pool, then lay back on the grass and looked at the stars.” –16 word tokens, 14 word types (not counting punctuation) ※ “Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word

How Many Words in English? • Shakespeare’s complete works • 884,647 wordform tokens • 29,066 wordform types • Brown Corpus • 1 million wordform tokens • 61,805 wordform types • 37,851 lemma types

Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • First attempt: • Suppose there is no corpus available • Use uniform distribution • Assume: • word types = V (e.g., 100,000)

Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Second attempt: • Suppose there is a corpus • Assume: • word tokens = N • # times w appears in corpus = C(w)

Simple (Unsmoothed) N-grams • Task: Estimating the probability of a word • Third attempt: • Suppose there is a corpus • Assume a word depends on its n –1 previous words

Simple (Unsmoothed) N-grams

Simple (Unsmoothed) N-grams • n-gram approximation: • Wk only depends on its previous n–1words

Note on Practical Problem • Multiplying many probabilities results in a very small number and can cause numerical underflow • Use logprob instead in the actual computation

Estimating N-gram Probability • Maximum Likelihood Estimate (MLE)

Estimating Bigram Probability • Example: • C(to eat) = 860 • C(to) = 3256

Two Important facts • The increasing accuracy of N-gram models as we increse the value of N • Very strong dependency on their training corpus (in particular its genre and its size in words)

Smoothing • Any particular training corpus is finite • Sparse data problem • Deal with zero probability

Smoothing • Smoothing • Reevaluating zero probability n-grams and assigning them non-zero probability • Also called Discounting • Lowering non-zero n-gram counts in order to assign some probability mass to the zero n-grams

Add-One Smoothing for Bigram

Things Seen Once • Use the count of things seen once to help estimate the count of things never seen

Witten-Bell Discounting

Witten-Bell Discounting for Bigram

Seen count Unseen count

Good-Turing Discounting for Bigram

Backoff

Entropy • Measure of uncertainty • Used to evaluate quality of n-gram models (how well a language model matches a given language) • Entropy H(X) of a random variable X: • Measured in bits • Number of bits to encode information in the optimal coding scheme

Example 1

Example 2

Perplexity

Entropy of a Sequence

Entropy of a Language

Cross Entropy • Used for comparing two language models • p: Actual probability distribution that generated some data • m: A model of p (approximation to p) • Cross entropy of m on p:

Cross Entropy • By Shannon-McMillan-Breimantheorem: • Property of cross entropy: • Difference between H(p,m) and H(p) is a measure of how accurate model m is • The more accurate a model, the lower its cross-entropy

6. N-GRAMs

6. N-GRAMs

Presentation Transcript

What are n-grams good for?

N-Grams and Corpus Linguistics

Stoichiometry 2: grams to grams

Lesson 6: Fat grams and calories

N-Grams and Corpus Linguistics

Word-counts and N-grams

N-Grams and Corpus Linguistics

N-Grams and Corpus Linguistics

Body Mass, grams

On-Line Cumulative Learning of Hierarchical Sparse n -grams

From Web n-grams to collocation learning

Applicability of N-Grams to Data Classification

Converting Grams To Moles and Moles To Grams

From Grammar to N-grams

1.500 grams .

Comparing Word Relatedness Measures Based on Google n-grams

Chapter 4: N-GRAMS

Natural Language Processing Statistical Inference: n-grams

Language Modeling with N-Grams

N-Grams

Pattern Matching Using n -grams With Algebraic Signatures

What are n-grams good for?