Document Representation

Document Representation • Bag-of-words • Bag-of-facts • Bag-of-sentences • Bag-of-nets

Language Modeling in IR 2008-03-06

Document = Bag of Words • Document = Bag of Sentences, Sentence = word sequence p(南京市长)p(江大桥|南京市长) << p(南京市)p(长江大桥|南京市) p(中国人民大学) >> p(中国大学人民)

Agenda • Introduction to Language Model • What is LM • How can we use LM • What are the major issues in LM?

What is a LM? • “语言”就是其字母表上的某种概率分布,该分布反映了任何一个字母序列成为该语言的一个句子(或其他任何的语言单元) 的可能性,称这个概率分布为语言模型。 • 给定的一个语言，对于一个语言“句子”（符号串），可以估计其出现的概率。 • 例如：假定对于英语， p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊) > p4 (棕熊) • 若p1=p2,称为一阶语言模型,否则称为高阶语言模型

Basic Notation • M: language we are try to model, it can be thought as a source • s: observation (string of tokens) • P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M

How can we use LMs in IR • Use LM to model the process of query generation: • Every document in a collection defines a “language” • P(s|MD) defines the probability that author would write down string ”s” • Now suppose “q” is the user’s query • P(q|MD) is the probability of “q” during random sampling from the D, and can be treated as rank of document D in the collection

Other ways to Rank 查询相似(query-likelihood)：通过计算P (Q |MD) 进行排序，即通过计算文档模型能在多大程度上产生查询的概率来排序。文档相似(document-likelihood)：通过计算P (D|MQ)进行排序，即通过计算查询模型能在多大程度上产生文档的概率来排序。模型比较(model comparison)：通过计算P (MQ| | MD)进行排序，即通过计算查询模型与文档模型的相似性进行排序。

Major issues in applying LMs • What kind of language model should we use? • Unigram or high-order models • How can we estimate model parameters? • Basic model or advanced model • Data smoothing approaches

What kind of models is better? • Unigram model • Bigram model • High-order model

Unigram LMs • Words are “sampled” independently of each other • Joint probability decomposes into a production of marginals • P(xyz)=p(x)p(y)p(z) • P(“brown dog”)=P(“dog”)P(“brown”) • Estimation of probability :simple counting

Higher-order Models • n-gram: condition on preceding words • Cache: condition on a window • Grammar: condition on a parse tree • Are they useful? ? • Parameter estimation is very expensive!

Comparison • Song 和Croft指出，把一元语言模型和二元语言模型混合后的效果比只使用一元语言模型则好8%左右。不过，Victor Lavrenko指出，Song 和Croft 使用的多元模型得到的效果并不是一直比只用一元语言模型好。 • David R.H.Miller 指出一元语言模型和二元语言模型混合后得到的效果也要好于一元语言模型。 • 也有研究认为词序对于检索结果影响不大.

Major issues in applying LMs • What kind of language model should we use? • Unigram or high-order models • How can we estimate model parameters? • Basic model or advanced model • Data smoothing approaches

Estimation of parameter • Given a string of text S（=Q or D）, estimate its LM: Ms • Basic LMs • Maximum-likelihood estimation • Zero-frequency problem • Discounting technology • Interpolation method

Maximum-likelihood estimation • Let V be vocabulary of M,Q=q1q2…qm be a query, qi \in V, S=d1d2…dn be a doc. • Let Ms be the language model of S • P(Q|Ms) =? ,called query likelihood • P (Ms|Q) = P(Q| Ms)P(Ms)/P(Q) can be treated as the ranking of doc S. ~ P(Q| Ms)P(Ms) • Estimating P(Q|Ms),and P(Ms)

Maximum-likelihood estimation • Multinomial model(多项式模型) • 将查询被看成是多项试验的结果序列，因此考虑了词在查询中出现的次数。 • P(Q|Ms)=∏qi∈Q P(qi|Ms)= ∏w∈Q P(w|Ms)#(w,Q) • 上述两种办法都将转换成对P(w|Ms)的估计，也就是将IR问题转换成对文档语言模型的估计问题。从而可以利用LM的研究成果。

Maximum-likelihood estimation • 最简单的办法就是采用极大似然估计：Count relative frequencies of words in S P(w|Ms)=#(w,S)/|S| • 0-frenquency problem （由数据的稀疏性造成） • Assume some event not occur in S, then the probability is 0! • It is not correct, and we need to avoid it

Discounting Method • Laplace correction（add-one approach）: • Add 1 to every count,（normalize） • P(w|Ms)=（#(w,S)+1）/（|S|+|V|） • Problematic for large vocabularies（|V|太大的时候） • Lindstone correction（广义add-one方法） • Add a small constant to every count • Leave-one-out discounting • Remove some word, compute p(S|Ms), repeat for every word in the document, and maximize overall likelihood • Ref. Chen SF and Goodman JT: an empirical study of smoothing technology for language modeling, proc. 34th annual meeting of the ACL,1996

Additive smoothing methods • PML(s|Ms)=[ #(w,S)+c]/[|S|+c|V|] • P(w) = 1/|V| • \&= c/[|S|+c|V|]

Jelinek-Mercer 方法 • Set c to be a constant, independent of document and query • Tune to optimize retrieval performance on different database, query set, etc.

Dirichlet 方法 • c=N/(N+u) ,1-c =u/(N+u), • N: sample size from the collection, or the length of S, u is a parameter

平滑对检索性能的影响 • Zhai CX, Lafferty J, A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR 2001 • Zhai CX Lafferty J, A study of smoothing methods for language models applied to information retrieval. ACM TOIS 22(2)179-214 • 平滑有两个作用：一是估计，解决0概率问题，二是查询建模，消除或者降低噪音的影响

LM Tools • LEMUR • www.cs.cmu.edu/~lemur • CMU/UMass joint project • C++, good documentation, forum-based support • Ad-hoc IR, Clustering, Q/A systems • ML+smoothing, … • YARI • lavrenko@cs.umass.edu • Ad-hoc IR, cross-language,classification • ML+smoothing,…

Other applications of LM • Topic detection and tracking • Treat “q” as a topic description • Classification/ filtering • Cross-language retrieval • Multi-media retrieval

References • Ponte JM, Croft WB, A Language Modeling approach to Information Retrieval, ACM SIGIR 1998, pp275-281 • Ponte JM, A Language Modeling approach to Information Retrieval, PhD Dissertation, UMass, 1998

Bag-of-nets • 如果文本的概念用本体来表达,也就是将从文本中抽取出的概念放在领域本体的背景下,形成一个概念的网络,情况将如何呢? • 可否利用Bayesian Network?关键是怎么理解词与词之间的关系,是否具有因果关系? • 比如上下位关系? 关联关系?

Document Representation