240 likes | 332 Views
Language Models for TR. Rong Jin Department of Computer Science and Engineering Michigan State University. What is a Statistical LM?. A probability distribution over word sequences p(“ Today is Wednesday ”) 0.001 p(“ Today Wednesday is ”) 0.0000000000001
E N D
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University
What is a Statistical LM? • A probability distribution over word sequences • p(“Today is Wednesday”) 0.001 • p(“Today Wednesday is”) 0.0000000000001 • p(“The eigenvalue is positive”) 0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model
Why is a LM Useful? • Provides a principled way to quantify the uncertainties associated with natural language • Allows us to answer questions like: • Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)
The Simplest Language Model(Unigram Model) • Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution
Text mining paper Food nutrition paper Text Generation with Unigram LM (Unigram) Language Model p(w| ) Sampling Document … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health
… text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 Estimation of Unigram LM (Unigram) Language Model p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)
Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Language Models for Retrieval(Ponte & Croft 98) Document Query = “data mining algorithms” Text mining paper Food nutrition paper
Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN Ranking Docs by Query Likelihood d1 q d2 dN
But, where is the relevance? And, what’s good about this approach?
Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) The Notion of Relevance
Refining P(R=1|Q,D) Method 2:generative models • Basic idea • Define P(Q,D|R) • Compute P(R|Q,D) using Bayes’ rule • Special cases • Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) • Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D
Query Generation Query likelihoodp(q| d) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model
Document language model Retrieval as Language Model Estimation • Document ranking based on query likelihood • Retrieval problem Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches
Discounted ML estimate Collection language model A General Smoothing Scheme • All smoothing methods try to • discount the probability of words seen in a doc • re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words
Doc length normalization (long doc is expected to have a smaller d) TF weighting IDFweighting Ignore for ranking Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain • Smoothing with p(w|C) TF-IDF + length norm.
Three Smoothing Methods(Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) • Dirichlet prior (Bayesian): Assumepseudo countsp(w|C) • Absolute discounting: Subtract a constant
Keyword queries Verbose queries The Need of Query-Modeling(Dual-Role of Smoothing)
Query = “the algorithms for data mining” d1: 0.04 0.001 0.02 0.002 0.003 d2: 0.02 0.001 0.01 0.003 0.004 Another Reason for Smoothing p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) But p(q|d1)>p(q|d2)! We should make p(“the”) and p(“for”) less different for all docs.
Stage-1 -Explain unseen words -Dirichlet prior(Bayesian) Stage-2 -Explain noise in query -2-component mixture c(w,d) +p(w|C) (1-) + p(w|U) |d| + P(w|d) = Two-stage Smoothing
w1 Leave-one-out P(w1|d- w1) log-likelihood w2 P(w2|d- w2) Maximum Likelihood Estimator ... wn Newton’s Method P(wn|d- wn) Estimating using leave-one-out
Stage-2 Stage-1 1 d1 P(w|d1) (1-)p(w|d1)+p(w|U) ... … ... query N dN P(w|dN) (1-)p(w|dN)+p(w|U) Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm Estimating using Mixture Model
Automatic 2-stage results Optimal 1-stage results Average precision (3 DB’s + 4 query types, 150 topics)
Acknowledgement • Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval