1 / 50

Language Modeling Approaches for Information Retrieval

Language Modeling Approaches for Information Retrieval. Rong Jin. ?. ?. ?. d1. …. d1000. q : ‘bush Kerry’. A Probabilistic Framework for Information Retrieval. Estimating likelihood p(q|  ). Estimating some statistics  for each document.

odell
Download Presentation

Language Modeling Approaches for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling Approaches for Information Retrieval Rong Jin

  2. ? ? ? d1 … d1000 q: ‘bush Kerry’ A Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics  for each document

  3. A Probabilistic Framework for Information Retrieval • Three fundamental questions • What statistics  should be chosen to describe the characteristics of documents ? • How to estimate this statistics ? • How to compute the likelihood of generating queries given the statistics ?

  4. Unigram Language Model • Probabilities for single word p(w) • ={p(w) for any word w in vocabulary V} • Estimate an unigram language model • Simple counting Given a document d, count term frequency c(w,d) for each word w. Then, p(w) = c(w,d)/|d|

  5. Statistical Inference • C1: h, h, h, h, t, h  bias b1 = 5/6 • C2: t, t, h, t, h, h  bias b2 = 1/2 • C3: t, h, t, t, t, h  bias b3 = 1/3 • Why counting provide a good estimate of coin bias?

  6. Maximum Likelihood Estimation (MLE) • Observation o={o1, o2,…, on} • Maximum likelihood estimation E.g.: o={h, h, h, t, h,h} • Pr(o|b) = b5(1-b)

  7. Unigram Language Model • Observation: d={tf1, tf2,…, tfn} • Unigram language model ={p(w1), p(w2),…, p(wn)} • Maximum likelihood estimation

  8. Maximum A Posterior Estimation • Consider a special case: we only toss each coin twice • C1: h, t  b1=1/2 • C2: h, h  b2=1 • C3: t, t  b3 = 0 ? MLE estimation is poor when the number of observations is small. This is called “sparse data” problem !

  9. Solution to Sparse Data Problems • Shrinkage • Maximum a posterior (MAP) estimation • Bayesian approach

  10. Estimation based on individual document Estimation based on the corpus Shrinkage: Jelinek Mercer Smoothing • Linearly interpolate between document language model and the collection language model 0 <  < 1: is a smoothing parameter

  11. Smoothing & TF-IDF Weighting Are they totally irrelevant ?

  12. Smoothing & TF-IDF Weighting Similar to TF.IDF weighting irrelevant to documents

  13. Maximum A Posterior Estimation • Introduce a prior on b • Most of coins are more or less unbiased • A Dirichlet prior on b

  14. Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

  15. Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

  16. Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

  17. Pseudo counts (or pseudo experiments) Maximum A Posterior Estimation • Observation o={o1, o2,…, on} • Maximum A Posterior Estimation

  18. iare called hyper-parameters Dirichlet Prior • Given a distribution • A Dirichlet distribution for p is defined as

  19. Dirichlet Prior • Example: • Full Dirichlet distribution: • (x) is gamma function

  20. Dirichlet Prior • Dirichlet is a distribution of distribution • The prior knowledge about distribution p is encode in hyper-parameters  • The maximum point of Dirichlet distribution is at pi = (i-1)/(1+ 2+…+ n-n)  pi  i and i=cpi+1, • Example: • Prior knowledge: most coins are fair  b=1-b=1/2 • 1= 2 = c 

  21. Unigram Language Model • Simple counting  zero probabilities • Introduce Dirichlet priors to smooth the language model • How to construct the Dirichlet prior?

  22. How to determine the appropriate value for the hyper-parameters i Dirichlet Prior for Unigram LM • Prior for what distribution? d={p(w1|d), p(w2|d),…, p(wn|d)}

  23. Determine Hyper-parameters • The most likely determined language model by Dirichlet distribution is p(wi| d)  i • What is most likely p(wi| d)without looking into the content of the document d?

  24. Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i

  25. Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i ?

  26. Determine Hyper-parameters • The most likely p(wi| d)without looking into the content of the document d is the unigram probability of the collection: • c={p(w1|c), p(w2|c),…, p(wn|c)} • So what is appropriate value for i ?

  27. Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:

  28. Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:

  29. Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution:

  30. Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo term frequency

  31. Dirichlet Prior for Unigram LM • MAP estimation for best unigram language model • Solution: Pseudo document length

  32. Dirichlet Smoothed Unigram LM • What does p(w|d) looks like if s is small? • What does p(w|d) looks like if s is large?

  33. Dirichlet Smoothed Unigram LM • What does p(w|d) looks like if s is small? • What does p(w|d) looks like if s is large?

  34. Dirichlet Smoothed Unigram LM No longer zero probabilities

  35. Dirichlet Smoothed Unigram LM • Step 1: compute the collection based unigram language model by simple counting • Step 2: for each document dk, compute its smoothed unigram language model as

  36. Dirichlet Smoothed Unigram LM • Step 1: compute the collection based unigram language model by simple counting • Step 2: for each document dk, compute its smoothed unigram language model as

  37. Dirichlet Smoothed Unigram LM • For a given query q={tf1(q), tf2(q),…, tfn(q)} • For each document d, compute likelihood • The larger the likelihood, the more relevant the document is to the query

  38. Smoothing & TF-IDF Weighting Are they totally irrelevant ?

  39. Smoothing & TF-IDF Weighting

  40. Smoothing & TF-IDF Weighting

  41. Smoothing & TF-IDF Weighting Document normalization

  42. Smoothing & TF-IDF Weighting TF.IDF

  43. JM Smoothing Linear weight is a constant for JM smoothing It is document dependent for Dirichlet smoothing Dirichlet Smoothing Shrinkage vs. Dirichlet Smoothing • Linearly interpolate between document language model and the collection language model

  44. ? ? ? d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) Estimating some statistics  for each document

  45. d1 … d1000 q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1 Estimating some statistics  for each document

  46. q: ‘bush Kerry’ Current Probabilistic Framework for Information Retrieval Estimating likelihood p(q| ) 2 1000 1

  47. d1 … d1000 q: ‘bush Kerry’ Bayesian Approach We need to consider the uncertainty in model inference Estimating likelihood p(q| ) 2 1 1 1000 1 Estimating some statistics  for each document

  48. Bayesian Approach 1 d 2 q p(q|i) p(d|i) … n

  49. Bayesian Approach 1 d 2 q p(q|i) p(d|i) … n

  50. Bayesian Approach 1 d 2 q p(q|i) p(d|i) … Assume that p(d) and p(i) follow uniform distributions n

More Related