Language Modeling Approaches for Information Retrieval

Language Modeling Approaches for Information Retrieval Rong Jin

? ? ? d1 … d1000 q: ‘bush Kerry’ A Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics  for each document

A Probabilistic Framework for Information Retrieval • Three fundamental questions • What statistics  should be chosen to describe the characteristics of documents ? • How to estimate this statistics ? • How to compute the likelihood of generating queries given the statistics ?

Unigram Language Model • Probabilities for single word p(w) • ={p(w) for any word w in vocabulary V} • Estimate an unigram language model • Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|

Statistical Inference • C1: h, h, h, h, t, h  bias b1 = 5/6 • C2: t, t, h, t, h, h  bias b2 = 1/2 • C3: t, h, t, t, t, h  bias b3 = 1/3 • Why counting provide a good estimate of coin bias?

Maximum Likelihood Estimation (MLE) • Observation o={o1, o2,…, on} • Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} • Pr(o|b) = b5(1-b)

Unigram Language Model • Observation: d={tf1, tf2,…, tfn} • Unigram language model ={p(w1), p(w2),…, p(wn)} • Maximum likelihood estimation

Maximum A Posterior Estimation • Consider a special case: we only toss each coin twice • C1: h, t  b1=1/2 • C2: h, h  b2=1 • C3: t, t  b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !

Solution to Sparse Data Problems • Shrinkage • Maximum a posterior (MAP) estimation • Bayesian approach

Estimation based on individual document Estimation based on the corpus Shrinkage: Jelinek Mercer Smoothing • Linearly interpolate between document language model and the collection language model 0 <  < 1: is a smoothing parameter

Smoothing & TF-IDF Weighting Are they totally irrelevant ?

Smoothing & TF-IDF Weighting Similar to TF.IDF weighting irrelevant to documents

Maximum A Posterior Estimation • Introduce a prior on b • Most of coins are more or less unbiased • A Dirichlet prior on b

Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

Pseudo counts (or pseudo experiments) Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

iare called hyper-parameters Dirichlet Prior • Given a distribution • A Dirichlet distribution for p is defined as

Dirichlet Prior • Example: • Full Dirichlet distribution: • (x) is gamma function

Dirichlet Prior • Dirichlet is a distribution of distribution • The prior knowledge about distribution p is encode in hyper-parameters  • The maximum point of Dirichlet distribution is at pi = (i-1)/(1+ 2+…+ n-n)  pi  i and i=cpi+1, • Example: • Prior knowledge: most coins are fair  b=1-b=1/2 • 1= 2 = c 

Unigram Language Model • Simple counting  zero probabilities • Introduce Dirichlet priors to smooth the language model • How to construct the Dirichlet prior?

How to determine the appropriate value for the hyper-parameters i Dirichlet Prior for Unigram LM • Prior for what distribution? d={p(w1|d), p(w2|d),…, p(wn|d)}

Determine Hyper-parameters • The most likely determined language model by Dirichlet distribution is p(wi| d)  i • What is most likely p(wi| d)without looking into the content of the document d?

Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i

Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i ?

Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:

Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo term frequency

Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo document length

Dirichlet Smoothed Unigram LM • What does p(w|d) looks like if s is small? • What does p(w|d) looks like if s is large?

Dirichlet Smoothed Unigram LM No longer zero probabilities

Dirichlet Smoothed Unigram LM • Step 1: compute the collection based unigram language model by simple counting • Step 2: for each document dk, compute its smoothed unigram language model as

Dirichlet Smoothed Unigram LM • For a given query q={tf1(q), tf2(q),…, tfn(q)} • For each document d, compute likelihood • The larger the likelihood, the more relevant the document is to the query

Smoothing & TF-IDF Weighting Are they totally irrelevant ?

Smoothing & TF-IDF Weighting

Smoothing & TF-IDF Weighting Document normalization

Smoothing & TF-IDF Weighting TF.IDF

JM Smoothing Linear weight is a constant for JM smoothing It is document dependent for Dirichlet smoothing Dirichlet Smoothing Shrinkage vs. Dirichlet Smoothing • Linearly interpolate between document language model and the collection language model

? ? ? d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics  for each document

d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1 Estimating some statistics  for each document

q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1

d1 … d1000 q: ‘bush Kerry’ Bayesian Approach We need to consider the uncertainty in model inference Estimating likelihood p(q| ) 2 1 1 1000 1 Estimating some statistics  for each document

Bayesian Approach 1 d 2 q p(q|i) p(d|i) … n

Bayesian Approach 1 d 2 q p(q|i) p(d|i) … Assume that p(d) and p(i) follow uniform distributions n

Language Modeling Approaches for Information Retrieval

Language Modeling Approaches for Information Retrieval

Presentation Transcript

Artificial Intelligence Approaches for Information Retrieval

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Cross-Language Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Two-stage Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Cross Language Information Retrieval (CLIR)

A Language Modeling Approach to Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval

Language Modeling Frameworks for Information Retrieval

Challenges in Information Retrieval and Language Modeling

Dependence Language Model for Information Retrieval

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Cross-Language Information Retrieval (CLIR)

Statistical Language Modeling for Speech Recognition and Information Retrieval