1 / 83

n 元语言模型和平滑算法

n 元语言模型和平滑算法. 2005.10.17. 今天的内容. n 元语言模型 平滑算法 Statistical LM Toolkit 简介 Borrows heavily from slides on the Internet, including but not limited to those by Jushua Goodman, Jonathan Henke, Dragomir R. Radev, 刘挺 ,Jim Martin. A bad language model. What is a Language Model ?.

Download Presentation

n 元语言模型和平滑算法

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. n元语言模型和平滑算法 2005.10.17

  2. 今天的内容 • n元语言模型 • 平滑算法 • Statistical LM Toolkit简介 • Borrows heavily from slides on the Internet, including but not limited to those by Jushua Goodman, Jonathan Henke,Dragomir R. Radev,刘挺,Jim Martin

  3. A bad language model

  4. What is a Language Model? • A language model is a probability distribution over word sequences • P(“And nothing but the truth”)  0.001 • P(“And nuts sing on the roof”)  0 • 对每个词串都赋予一个概率。对合法的句子,概率高;不合法的,概率小。 • The sum of probabilities of all word sequences has to be 1.

  5. 在许多应用中语言模型不可或缺 • Speech recognition • Handwriting recognition • Spelling correction • 记得上次的作业吗?你们可以做得更好! • Optical character recognition • Machine translation

  6. 比如语音识别 Very useful for distinguishing “nothing but the truth” from “nuts sing on de roof”

  7. 如何计算概率:Chain rule • P(“And nothing but the truth”) = • P(“And”) P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”) P(w1,w2,…, wn) = P(w1) P(w2 |w1) P(w3|w1w2) … P(wn |w1w2…wn-1)

  8. Morkov approximation • Assume each word depends only on the limited local context, e.g. on previous two words. This is called trigram models • P(“the|… whole truth and nothing but”)  P(“the|nothing but”) • P(“truth|… whole truth and nothing but the”)  P(“truth|but the”)

  9. With Markov Assumption • N-1阶马尔可夫模型

  10. Caveat • The formulation P(Word| Some fixed prefix) is not really appropriate in many applications. • It is if we’re dealing with real time speech where we only have access to prefixes. • But if we’re dealing with text we already have the right and left contexts. There’s no a priori reason to stick to left contexts.

  11. N元语言模型 • n-1阶马尔科夫近似称为n元语言模型(LM, Language Model) • p(W)=∏i=1…d p(wi|wi-n+1,…,wi-1), d=|W| • n越大,需要估计的参数越多,假设词汇量为20,000 • 模型 需要的参数数量 • 0阶(一元Unigram) 19,999 • 1阶(二元bigram) 20,000*19,999 = 400 million • 2阶(三元trigram) 20,0002*19,999 = 8 trillion • 3阶(四元four-gram) 20,0003*19,999 = 1.6*1017

  12. N-gram • N=1 unigram • N=2 bigram Unigram and trigram are words but bigram is not? (http://shuan.justrockandroll.com/arc.php?topic=14) • N=3 trigram

  13. An aside • Monogram? • Digram? • Learn something! • http://phrontistery.info/numbers.html

  14. 语言模型的讨论 • n多大? • 理论上讲,越大越好 • 经验值:3,trigram用的最多 4元模型需要太多的参数,很难估计了

  15. n=3: • “large green ___________” • tree? mountain? frog? car?... • n=5 • “swallowed the large green ________” • pill? broccoli?

  16. Reliability vs. Discrimination可靠性和可区别性 • larger n: more information about the context of the specific instance • greater discrimination power • But it is very sparse • smaller n: more instances in training data, better statistical estimates • more reliability • more choice

  17. trigrams • How do we find probabilities? • Get real text, and start counting!

  18. Counting Words • Example: “He stepped out into the hall, was delighted to encounter a water brother” - how many words? • Word forms and lemmas. “cat” and “cats” share the same lemma (also tokens and types) • Shakespeare’s complete works: 884,647 word tokens and 29,066 word types • Brown corpus: 61,805 types and 37,851 lemmas (1 million words from 500 texts) • American Heritage 3rd edition has 200,000 “boldface forms” (including some multiword phrases)

  19. 数据准备 • 去掉格式符号 • 定义词的边界 • 定义句子边界(插入<s>和</s>等记号) • 字母的大小写(保留、忽略或者智能识别) • 数字(保留、替换为<num>等)

  20. n元数据 ' s 3550 of the 2507 to be 2235 in the 1917 I am 1366 of her 1268 to the 1142 it was 1010 had been 995 she had 978 to her 965 could not 945 I have 898 of his 880 and the 862 she was 843 have been 837 of a 745 for the 712 in a 707

  21. 最大似然估计 • 最大似然估计MLE • 是对训练数据的最佳估计 • 从训练数据T中获得Trigrams • 统计T中三个词连续出现的次数C3(wi-2,wi-1,wi) • 统计T中两个词连续出现的次数C2(wi-2,wi-1) • pMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1) P(“the | nothing but”)  C3(“nothing but the”) / C2(“nothing but”)

  22. Bigram Probabilities (from martin)

  23. An Aside on Logs • When computing sentence probabilities, you don’t really do all those multiplies. The numbers are too small and lead to underflows • Convert the probabilities to logs and then do additions. • To get the real probability (if you need it) go back to the antilog.

  24. Some Observations • The following numbers are very informative. Think about what they capture. • P(want|I) = .32 • P(to|want) = .65 • P(eat|to) = .26 • P(food|Chinese) = .56 • P(lunch|eat) = .055

  25. P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is Some More Observations

  26. MLE不适合用于NLP • MLE选择的参数使训练语料具有最高的概率,它没有浪费任何概率在训练语料中没有出现的事件中 • 但是MLE概率模型通常不适合做NLP的统计语言模型,会出现0概率,这是不允许的。(如果是0概率,交叉熵呢?)

  27. 举例1 • p(z|xy)=? • 假设训练语料为: … xya …; … xyd …; … xyd … • xyz没有出现过 • 我们能够说: p(a|xy)=1/3, p(d|xy)=2/3, p(z|xy)=0/3吗? • 不一定,因为xyz可能是一个常见的组合,但在现有的训练集中没有出现;另一方面,也许从语法上讲,xyz真的不能出现

  28. 分析 • 被除数越小,越不可靠 • 1/3可能太高, 100/300可能是对的 • 除数越小,越不可靠 • 1/300可能太高,100/30000可能是对的 • 总而言之,分子分母都很小的数字是不可靠的

  29. “Smoothing”平滑 • Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams • a.k.a. “Discounting methods” • 劫富济贫!

  30. 平滑简述 • 从大于0的MLE概率估算p(w)中得到p’(w) • p’(w)<p(w) • 劫富的剩余:S(p(w)- p’(w)) = D • 把D分配到p(w)=0的事件中 • 保证Sp’(w) = 1 • 问题:什么才叫富?

  31. 加1平滑 最简单,但不是真的能用 T:训练数据,V:词表,w: 词 平滑公式: p’(w|h)=(c(h,w)+1)/(c(h)+|V|) 特别:非条件分布时p’(w)=(c(w)+1)/(|T|+|V|) 问题:经常会|V|>c(h),甚至|V|>>c(h) 举例:T: <s>what is it what is small? |T|=8 V={what,is,it,small,?,<s>,flying,birds,are,a,bird,.}, |V|=12 p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.252×0.1252≈0.001 p(it is flying.)=0.125×0.25×0×0=0 p’(it)=0.1, p’(what)=0.15,p’(.)=0.05, p’(what is it?)=0.152*0.12 ≈0.0002 p’(it is flying.)=0.1*0.15*0.052 ≈0.00004

  32. 加1法也叫Laplace’s Law • Laplace’s Law actually gives far too much of the probability space to unseen events. • 因此能不能少给一点呢?

  33. ELE • Since the adding one process may be adding too much, we can add a smaller value . • PLID(w1,..,wn)=(C(w1,..,wn)+)/(|T|+|V|) and >0. ==> Lidstone’s Law • If =1/2, Lidstone’s Law corresponds to the expectation of the likelihood and is called the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law.

  34. 举例 • T: <s>what is it what is small? |T|=8 V={what,is,it,small,?,<s>,flying,birds,are,a,bird,.}, |V|=12 p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.252 *0.1252≈ 0.001 p(it is flying.)=0.125*0.25*02=0 取 λ=0.1 p’(it)=0.12, p’(what)=0.23,p’(.)=0.01, p’(what is it?)=0.232*0.122 ≈0.0007 p’(it is flying.)=0.12*0.23*0.012 ≈0.000003

  35. Held-Out Estimator • How much of the probability distribution should be “held out” to allow for previously unseen events? • Validate by holding out part of the training data. • How often do events unseen in training data occur in validation data? • (e.g., to choose  for Lidstone model)

  36. For each n-gram, w1,..,wn , we compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively. • Let Nr be the number of n-grams with frequency r in the training text. • Let Hrbe the total number of times that all n-grams that appeared r times in the training text appeared in the held out data. • Hr/Nris the average frequency of one of these n-grams • An estimate for the probability of one of these n-gram is: Pho(w1,..,wn)= Hr/(NrH) where C1(w1,..,wn) = r, and H is the number of n-grams in the held out data

  37. Pots of Data for Developing and Testing Models • Training data (80% of total data) • Held Out data (10% of total data). • Test Data (5-10% of total data). • Write an algorithm, train it, test it, note things it does wrong, revise it and repeat many times. • Keep development test data and final test data as development data is “seen” by the system during repeated testing. • Give final results by testing on n smaller samples of the test data and averaging.

  38. Cross-Validation(a.k.a. deleted estimation) • held out data is used to validate the model • divide data for both training and validation • Divide test data into 2 parts • Train on A, validate on B • Train on B, validate on A • Combine two models A B train validate Model 1 validate train Model 2 + Model 1 Model 2 Final Model

  39. 讨论 • 如何合并模型? • 根据大小合并 P=λPt + (1-λ)Pho合并训练和held out集 • Divide data into parts 0 and 1. In one model use 0 as the training data and 1 as the held out data. In another model use 1 as training and 0 as held out data. Do a weighted average of the two: Pdel(w1,..,wn)=(Hr01+Hr10)/ (Nr0+Nr1) N

  40. Jelinek公式推导 • 要求数据集0,1的大小相同,均为N • Pdel(w1,..,wn)= Hr01/ Nr0N * Nr0/ (Nr0+Nr1) +Hr10/ Nr1 N * Nr1 / (Nr0+Nr1) =(Hr01+Hr10)/ (Nr0+Nr1) N

  41. 作业1 • 请大家做实验,到fsnlp下载Jane Austin小说,分别做以下测试: • 把在训练集直接构造的ELE语言模型进行测试(在测试数据求交叉熵) • 把训练集划分成同样的两部分,分别作训练集,held out数据,采用deleted estimation测试; 或者把训练集划分成2/3, 1/3, 同样做deleted estimation测试; 效果怎样?如果根据大小合并训练和head out 概率模型,再测试. 效果又怎样?

  42. 关于作业的说明:对语言模型的评价:Perplexity N 是单词的个数, i是句子的个数

  43. held out测试问题 • 假设H=N,对于出现一次的n元组,一般Nr很大. Pho = Hr/(HNr) • 合并训练和held out 集合作MLE P=(rNr+Hr)/(2H) 所以, Pho < P,因此: deleted estimation underestimates the expected frequency of objects that were seen once in the training data.

  44. Leaving-one-out (Ney et. Al. 1997) Data divided into K sets and the hold out method is repeated K times.

  45. Witten Bell 平滑 • First compute the probability of an unseen event occurring • Then distribute that probability mass among the as yet unseen types (the ones with zero counts)

  46. Simple case of unigrams T is the number of events that are seen for the first time in the corpus This is just the number of types since each type had to occur for a first time once N is just the number of observations Probability of an Unseen Event

  47. Distributing • The amount to be distributed is • The number of events with count zero • So distributing evenly gets us

More Related