840 likes | 1.19k Views
n 元语言模型和平滑算法. 2005.10.17. 今天的内容. n 元语言模型 平滑算法 Statistical LM Toolkit 简介 Borrows heavily from slides on the Internet, including but not limited to those by Jushua Goodman, Jonathan Henke, Dragomir R. Radev, 刘挺 ,Jim Martin. A bad language model. What is a Language Model ?.
E N D
n元语言模型和平滑算法 2005.10.17
今天的内容 • n元语言模型 • 平滑算法 • Statistical LM Toolkit简介 • Borrows heavily from slides on the Internet, including but not limited to those by Jushua Goodman, Jonathan Henke,Dragomir R. Radev,刘挺,Jim Martin
What is a Language Model? • A language model is a probability distribution over word sequences • P(“And nothing but the truth”) 0.001 • P(“And nuts sing on the roof”) 0 • 对每个词串都赋予一个概率。对合法的句子,概率高;不合法的,概率小。 • The sum of probabilities of all word sequences has to be 1.
在许多应用中语言模型不可或缺 • Speech recognition • Handwriting recognition • Spelling correction • 记得上次的作业吗?你们可以做得更好! • Optical character recognition • Machine translation
比如语音识别 Very useful for distinguishing “nothing but the truth” from “nuts sing on de roof”
如何计算概率:Chain rule • P(“And nothing but the truth”) = • P(“And”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”) P(w1,w2,…, wn) = P(w1) P(w2 |w1) P(w3|w1w2) … P(wn |w1w2…wn-1)
Morkov approximation • Assume each word depends only on the limited local context, e.g. on previous two words. This is called trigram models • P(“the|… whole truth and nothing but”) P(“the|nothing but”) • P(“truth|… whole truth and nothing but the”) P(“truth|but the”)
With Markov Assumption • N-1阶马尔可夫模型
Caveat • The formulation P(Word| Some fixed prefix) is not really appropriate in many applications. • It is if we’re dealing with real time speech where we only have access to prefixes. • But if we’re dealing with text we already have the right and left contexts. There’s no a priori reason to stick to left contexts.
N元语言模型 • n-1阶马尔科夫近似称为n元语言模型(LM, Language Model) • p(W)=∏i=1…d p(wi|wi-n+1,…,wi-1), d=|W| • n越大,需要估计的参数越多,假设词汇量为20,000 • 模型 需要的参数数量 • 0阶(一元Unigram) 19,999 • 1阶(二元bigram) 20,000*19,999 = 400 million • 2阶(三元trigram) 20,0002*19,999 = 8 trillion • 3阶(四元four-gram) 20,0003*19,999 = 1.6*1017
N-gram • N=1 unigram • N=2 bigram Unigram and trigram are words but bigram is not? (http://shuan.justrockandroll.com/arc.php?topic=14) • N=3 trigram
An aside • Monogram? • Digram? • Learn something! • http://phrontistery.info/numbers.html
语言模型的讨论 • n多大? • 理论上讲,越大越好 • 经验值:3,trigram用的最多 4元模型需要太多的参数,很难估计了
n=3: • “large green ___________” • tree? mountain? frog? car?... • n=5 • “swallowed the large green ________” • pill? broccoli?
Reliability vs. Discrimination可靠性和可区别性 • larger n: more information about the context of the specific instance • greater discrimination power • But it is very sparse • smaller n: more instances in training data, better statistical estimates • more reliability • more choice
trigrams • How do we find probabilities? • Get real text, and start counting!
Counting Words • Example: “He stepped out into the hall, was delighted to encounter a water brother” - how many words? • Word forms and lemmas. “cat” and “cats” share the same lemma (also tokens and types) • Shakespeare’s complete works: 884,647 word tokens and 29,066 word types • Brown corpus: 61,805 types and 37,851 lemmas (1 million words from 500 texts) • American Heritage 3rd edition has 200,000 “boldface forms” (including some multiword phrases)
数据准备 • 去掉格式符号 • 定义词的边界 • 定义句子边界(插入<s>和</s>等记号) • 字母的大小写(保留、忽略或者智能识别) • 数字(保留、替换为<num>等)
n元数据 ' s 3550 of the 2507 to be 2235 in the 1917 I am 1366 of her 1268 to the 1142 it was 1010 had been 995 she had 978 to her 965 could not 945 I have 898 of his 880 and the 862 she was 843 have been 837 of a 745 for the 712 in a 707
最大似然估计 • 最大似然估计MLE • 是对训练数据的最佳估计 • 从训练数据T中获得Trigrams • 统计T中三个词连续出现的次数C3(wi-2,wi-1,wi) • 统计T中两个词连续出现的次数C2(wi-2,wi-1) • pMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1) P(“the | nothing but”) C3(“nothing but the”) / C2(“nothing but”)
An Aside on Logs • When computing sentence probabilities, you don’t really do all those multiplies. The numbers are too small and lead to underflows • Convert the probabilities to logs and then do additions. • To get the real probability (if you need it) go back to the antilog.
Some Observations • The following numbers are very informative. Think about what they capture. • P(want|I) = .32 • P(to|want) = .65 • P(eat|to) = .26 • P(food|Chinese) = .56 • P(lunch|eat) = .055
P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is Some More Observations
MLE不适合用于NLP • MLE选择的参数使训练语料具有最高的概率,它没有浪费任何概率在训练语料中没有出现的事件中 • 但是MLE概率模型通常不适合做NLP的统计语言模型,会出现0概率,这是不允许的。(如果是0概率,交叉熵呢?)
举例1 • p(z|xy)=? • 假设训练语料为: … xya …; … xyd …; … xyd … • xyz没有出现过 • 我们能够说: p(a|xy)=1/3, p(d|xy)=2/3, p(z|xy)=0/3吗? • 不一定,因为xyz可能是一个常见的组合,但在现有的训练集中没有出现;另一方面,也许从语法上讲,xyz真的不能出现
分析 • 被除数越小,越不可靠 • 1/3可能太高, 100/300可能是对的 • 除数越小,越不可靠 • 1/300可能太高,100/30000可能是对的 • 总而言之,分子分母都很小的数字是不可靠的
“Smoothing”平滑 • Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams • a.k.a. “Discounting methods” • 劫富济贫!
平滑简述 • 从大于0的MLE概率估算p(w)中得到p’(w) • p’(w)<p(w) • 劫富的剩余:S(p(w)- p’(w)) = D • 把D分配到p(w)=0的事件中 • 保证Sp’(w) = 1 • 问题:什么才叫富?
加1平滑 最简单,但不是真的能用 T:训练数据,V:词表,w: 词 平滑公式: p’(w|h)=(c(h,w)+1)/(c(h)+|V|) 特别:非条件分布时p’(w)=(c(w)+1)/(|T|+|V|) 问题:经常会|V|>c(h),甚至|V|>>c(h) 举例:T: <s>what is it what is small? |T|=8 V={what,is,it,small,?,<s>,flying,birds,are,a,bird,.}, |V|=12 p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.252×0.1252≈0.001 p(it is flying.)=0.125×0.25×0×0=0 p’(it)=0.1, p’(what)=0.15,p’(.)=0.05, p’(what is it?)=0.152*0.12 ≈0.0002 p’(it is flying.)=0.1*0.15*0.052 ≈0.00004
加1法也叫Laplace’s Law • Laplace’s Law actually gives far too much of the probability space to unseen events. • 因此能不能少给一点呢?
ELE • Since the adding one process may be adding too much, we can add a smaller value . • PLID(w1,..,wn)=(C(w1,..,wn)+)/(|T|+|V|) and >0. ==> Lidstone’s Law • If =1/2, Lidstone’s Law corresponds to the expectation of the likelihood and is called the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law.
举例 • T: <s>what is it what is small? |T|=8 V={what,is,it,small,?,<s>,flying,birds,are,a,bird,.}, |V|=12 p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.252 *0.1252≈ 0.001 p(it is flying.)=0.125*0.25*02=0 取 λ=0.1 p’(it)=0.12, p’(what)=0.23,p’(.)=0.01, p’(what is it?)=0.232*0.122 ≈0.0007 p’(it is flying.)=0.12*0.23*0.012 ≈0.000003
Held-Out Estimator • How much of the probability distribution should be “held out” to allow for previously unseen events? • Validate by holding out part of the training data. • How often do events unseen in training data occur in validation data? • (e.g., to choose for Lidstone model)
For each n-gram, w1,..,wn , we compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively. • Let Nr be the number of n-grams with frequency r in the training text. • Let Hrbe the total number of times that all n-grams that appeared r times in the training text appeared in the held out data. • Hr/Nris the average frequency of one of these n-grams • An estimate for the probability of one of these n-gram is: Pho(w1,..,wn)= Hr/(NrH) where C1(w1,..,wn) = r, and H is the number of n-grams in the held out data
Pots of Data for Developing and Testing Models • Training data (80% of total data) • Held Out data (10% of total data). • Test Data (5-10% of total data). • Write an algorithm, train it, test it, note things it does wrong, revise it and repeat many times. • Keep development test data and final test data as development data is “seen” by the system during repeated testing. • Give final results by testing on n smaller samples of the test data and averaging.
Cross-Validation(a.k.a. deleted estimation) • held out data is used to validate the model • divide data for both training and validation • Divide test data into 2 parts • Train on A, validate on B • Train on B, validate on A • Combine two models A B train validate Model 1 validate train Model 2 + Model 1 Model 2 Final Model
讨论 • 如何合并模型? • 根据大小合并 P=λPt + (1-λ)Pho合并训练和held out集 • Divide data into parts 0 and 1. In one model use 0 as the training data and 1 as the held out data. In another model use 1 as training and 0 as held out data. Do a weighted average of the two: Pdel(w1,..,wn)=(Hr01+Hr10)/ (Nr0+Nr1) N
Jelinek公式推导 • 要求数据集0,1的大小相同,均为N • Pdel(w1,..,wn)= Hr01/ Nr0N * Nr0/ (Nr0+Nr1) +Hr10/ Nr1 N * Nr1 / (Nr0+Nr1) =(Hr01+Hr10)/ (Nr0+Nr1) N
作业1 • 请大家做实验,到fsnlp下载Jane Austin小说,分别做以下测试: • 把在训练集直接构造的ELE语言模型进行测试(在测试数据求交叉熵) • 把训练集划分成同样的两部分,分别作训练集,held out数据,采用deleted estimation测试; 或者把训练集划分成2/3, 1/3, 同样做deleted estimation测试; 效果怎样?如果根据大小合并训练和head out 概率模型,再测试. 效果又怎样?
关于作业的说明:对语言模型的评价:Perplexity N 是单词的个数, i是句子的个数
held out测试问题 • 假设H=N,对于出现一次的n元组,一般Nr很大. Pho = Hr/(HNr) • 合并训练和held out 集合作MLE P=(rNr+Hr)/(2H) 所以, Pho < P,因此: deleted estimation underestimates the expected frequency of objects that were seen once in the training data.
Leaving-one-out (Ney et. Al. 1997) Data divided into K sets and the hold out method is repeated K times.
Witten Bell 平滑 • First compute the probability of an unseen event occurring • Then distribute that probability mass among the as yet unseen types (the ones with zero counts)
Simple case of unigrams T is the number of events that are seen for the first time in the corpus This is just the number of types since each type had to occur for a first time once N is just the number of observations Probability of an Unseen Event
Distributing • The amount to be distributed is • The number of events with count zero • So distributing evenly gets us