CS 479, section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lectures #13: Language Model Smoothing: Discounting Thanks to Dan Klein of UC Berkeley for many of the materials used in this lecture.

Announcements • Project #1, Part 1 • Early: today • Due: Friday • “Under construction” bar on the schedule to be moved forward today.

Objectives • Get comfortable with the process of factoring and smoothing a joint model of a familiar object: text! • Look closely at discounting techniques for smoothing language models.

Review: Language Models • Is a Language Model (i.e., Markov Model) a joint or conditional distribution? Over what? • Is a local model a joint or conditional distribution? Over what? • How are the Markov model and the local model related?

Review: Smoothing • What is smoothing? • Name two ways to accomplish smoothing.

Discounting • Starting point: Maximum Likelihood Estimator (MLE) • a.k.a. Empirical Estimator • Key discounting problem: • What count c*(w,h) should we use for an event that occurred c(w,h) times in N samples? • Call c*(w,h) “r*” for short. • Corrollary problem: • What probability p* = r*/N* should we use for an event that occurred with probability p = r/N

Discounting r allegations attack man outcome reports … claims request allegations allegations attack man outcome r* … reports claims request

Zipf • Rank word types by token frequency • Zipf’s Law • Frequency is inversely proportional to rank • Most word types are rare

Notation

Discounting: Add-One Estimation • Idea: pretend we saw every n-gram once more than we actually did [Laplace] • And for unigrams: • Observed count r > 0  re-estimated count 0 < r* < r • V = N1+ + N0 • Reserves V/(N+V) for extra events • Observed: N1+/(N+V) is distributed back to seen n-grams • “Unobserved”: N0/(N+V) is reserved for unseen words (nearly all of the vocab!) • Actually tells us not only how much to hold out, but where to put it (evenly) • Works astonishingly poorly in practice

Discounting: Add-Epsilon • Quick fix: add some small  instead of 1 [Lidstone, Jefferys] • Slightly better, holds out less mass • Still a bad idea

Discounting: Add-Epsilon • Let  =  (1/V) • A uniform prior over vocabulary, from a Bayesian p.o.v.

Discounting: Unigram Prior • Better to assume a unigram prior • Where the unigram is empirical or smoothed itself:

How Much Mass to Withhold? • Remember the key discounting problem: • What count r* should we use for an event that occurred r times in N samples? • For the rich, r is too big • For the poor, r is too small (0)

Idea: Use Held-out Data Sample #1: N words (or n-grams) Sample #2: N words (or n-grams) the the rates interest rates the the interest rates interest rates the Federal the Federal [Jelinek and Mercer]

Discounting: Held-out Estimation • Summary: • Get another N samples (sample #1) • Collect the set of types occurring r times (in sample #1) • Get another N samples (sample #2) • What is the average frequency of those types in sample #2? • e.g., in our example, doubletons occurred on average 1.5 times • Use that average as r* • r = 2  r* = 1.5 • In practice, much better than add-one, etc. • Main issue: requires large (equal) amount of held-out data

What’s Next • Upcoming lectures: • More sophisticated discounting strategies • Overview of Speech Recognition • The value of language models

CS 479, section 1: Natural Language Processing

CS 479, section 1: Natural Language Processing

Presentation Transcript

PROCESSING OF CERAMICS AND CERMETS

Lexical Semantics and Ontologies Tutorial at the ACL/HCSnet 2006 Advanced Program in Natural Language Processing

Natural Language Processing Applications

PSYCHOLOGY OF LANGUAGE

CS 388: Natural Language Processing: Part-Of-Speech Tagging, Sequence Labeling, and Hidden Markov Models (HMMs)

Language processing: introduction to compiler construction

Section 1b of the exam question:

LING / C SC 439/539 Statistical Natural Language Processing

Statistical Natural Language Parsing

Natural Language Processing (4)

Leica Geo Office GNSS Processing

Speech perception as a window into language processing:

Natural Language Processing, Linguistics and Terminology

Membranes for Gas Conditioning

WJEC GCSE H English Language Preparing for the Reading Section

Assembly Language

Image Processing

Chapter 15 Evolution

Natural Language Processing

Kernel Methods in Natural Language Processing