510 likes | 790 Views
Language Modeling Approaches for Information Retrieval. Rong Jin. ?. ?. ?. d1. …. d1000. q : ‘bush Kerry’. A Probabilistic Framework for Information Retrieval. Estimating likelihood p(q| ). Estimating some statistics for each document.
E N D
Language Modeling Approaches for Information Retrieval Rong Jin
? ? ? d1 … d1000 q: ‘bush Kerry’ A Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics for each document
A Probabilistic Framework for Information Retrieval • Three fundamental questions • What statistics should be chosen to describe the characteristics of documents ? • How to estimate this statistics ? • How to compute the likelihood of generating queries given the statistics ?
Unigram Language Model • Probabilities for single word p(w) • ={p(w) for any word w in vocabulary V} • Estimate an unigram language model • Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|
Statistical Inference • C1: h, h, h, h, t, h bias b1 = 5/6 • C2: t, t, h, t, h, h bias b2 = 1/2 • C3: t, h, t, t, t, h bias b3 = 1/3 • Why counting provide a good estimate of coin bias?
Maximum Likelihood Estimation (MLE) • Observation o={o1, o2,…, on} • Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} • Pr(o|b) = b5(1-b)
Unigram Language Model • Observation: d={tf1, tf2,…, tfn} • Unigram language model ={p(w1), p(w2),…, p(wn)} • Maximum likelihood estimation
Maximum A Posterior Estimation • Consider a special case: we only toss each coin twice • C1: h, t b1=1/2 • C2: h, h b2=1 • C3: t, t b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !
Solution to Sparse Data Problems • Shrinkage • Maximum a posterior (MAP) estimation • Bayesian approach
Estimation based on individual document Estimation based on the corpus Shrinkage: Jelinek Mercer Smoothing • Linearly interpolate between document language model and the collection language model 0 < < 1: is a smoothing parameter
Smoothing & TF-IDF Weighting Are they totally irrelevant ?
Smoothing & TF-IDF Weighting Similar to TF.IDF weighting irrelevant to documents
Maximum A Posterior Estimation • Introduce a prior on b • Most of coins are more or less unbiased • A Dirichlet prior on b
Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation
Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation
Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation
Pseudo counts (or pseudo experiments) Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation
iare called hyper-parameters Dirichlet Prior • Given a distribution • A Dirichlet distribution for p is defined as
Dirichlet Prior • Example: • Full Dirichlet distribution: • (x) is gamma function
Dirichlet Prior • Dirichlet is a distribution of distribution • The prior knowledge about distribution p is encode in hyper-parameters • The maximum point of Dirichlet distribution is at pi = (i-1)/(1+ 2+…+ n-n) pi i and i=cpi+1, • Example: • Prior knowledge: most coins are fair b=1-b=1/2 • 1= 2 = c
Unigram Language Model • Simple counting zero probabilities • Introduce Dirichlet priors to smooth the language model • How to construct the Dirichlet prior?
How to determine the appropriate value for the hyper-parameters i Dirichlet Prior for Unigram LM • Prior for what distribution? d={p(w1|d), p(w2|d),…, p(wn|d)}
Determine Hyper-parameters • The most likely determined language model by Dirichlet distribution is p(wi| d) i • What is most likely p(wi| d)without looking into the content of the document d?
Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i
Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i ?
Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i ?
Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:
Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:
Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:
Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo term frequency
Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo document length
Dirichlet Smoothed Unigram LM • What does p(w|d) looks like if s is small? • What does p(w|d) looks like if s is large?
Dirichlet Smoothed Unigram LM • What does p(w|d) looks like if s is small? • What does p(w|d) looks like if s is large?
Dirichlet Smoothed Unigram LM No longer zero probabilities
Dirichlet Smoothed Unigram LM • Step 1: compute the collection based unigram language model by simple counting • Step 2: for each document dk, compute its smoothed unigram language model as
Dirichlet Smoothed Unigram LM • Step 1: compute the collection based unigram language model by simple counting • Step 2: for each document dk, compute its smoothed unigram language model as
Dirichlet Smoothed Unigram LM • For a given query q={tf1(q), tf2(q),…, tfn(q)} • For each document d, compute likelihood • The larger the likelihood, the more relevant the document is to the query
Smoothing & TF-IDF Weighting Are they totally irrelevant ?
Smoothing & TF-IDF Weighting Document normalization
Smoothing & TF-IDF Weighting TF.IDF
JM Smoothing Linear weight is a constant for JM smoothing It is document dependent for Dirichlet smoothing Dirichlet Smoothing Shrinkage vs. Dirichlet Smoothing • Linearly interpolate between document language model and the collection language model
? ? ? d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics for each document
d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1 Estimating some statistics for each document
q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1
d1 … d1000 q: ‘bush Kerry’ Bayesian Approach We need to consider the uncertainty in model inference Estimating likelihood p(q| ) 2 1 1 1000 1 Estimating some statistics for each document
Bayesian Approach 1 d 2 q p(q|i) p(d|i) … n
Bayesian Approach 1 d 2 q p(q|i) p(d|i) … n
Bayesian Approach 1 d 2 q p(q|i) p(d|i) … Assume that p(d) and p(i) follow uniform distributions n