Language Models for TR

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University

What is a Statistical LM? • A probability distribution over word sequences • p(“Today is Wednesday”)  0.001 • p(“Today Wednesday is”)  0.0000000000001 • p(“The eigenvalue is positive”)  0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Why is a LM Useful? • Provides a principled way to quantify the uncertainties associated with natural language • Allows us to answer questions like: • Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

The Simplest Language Model(Unigram Model) • Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution

Text mining paper Food nutrition paper Text Generation with Unigram LM (Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health

… text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 Estimation of Unigram LM (Unigram) Language Model  p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)

Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Language Models for Retrieval(Ponte & Croft 98) Document Query = “data mining algorithms” Text mining paper Food nutrition paper

Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN Ranking Docs by Query Likelihood d1 q d2 dN

But, where is the relevance? And, what’s good about this approach?

Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) The Notion of Relevance

Query Generation Query likelihoodp(q| d) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model

Document language model Retrieval as Language Model Estimation • Document ranking based on query likelihood • Retrieval problem  Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches

Discounted ML estimate Collection language model A General Smoothing Scheme • All smoothing methods try to • discount the probability of words seen in a doc • re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words

Doc length normalization (long doc is expected to have a smaller d) TF weighting IDFweighting Ignore for ranking Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain • Smoothing with p(w|C) TF-IDF + length norm.

Three Smoothing Methods(Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) • Dirichlet prior (Bayesian): Assumepseudo countsp(w|C) • Absolute discounting: Subtract a constant

Comparison of Three Methods

Keyword queries Verbose queries The Need of Query-Modeling(Dual-Role of Smoothing)

Stage-1 -Explain unseen words -Dirichlet prior(Bayesian) Stage-2 -Explain noise in query -2-component mixture c(w,d) +p(w|C) (1-) + p(w|U)   |d| + P(w|d) = Two-stage Smoothing

w1 Leave-one-out P(w1|d- w1) log-likelihood w2 P(w2|d- w2) Maximum Likelihood Estimator ... wn Newton’s Method P(wn|d- wn) Estimating  using leave-one-out

Automatic 2-stage results  Optimal 1-stage results Average precision (3 DB’s + 4 query types, 150 topics)

Acknowledgement • Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval

Language Models for TR

Language Models for TR

Presentation Transcript

Language Models for Information Retrieval

KNR 273 : TR Models Continued

Models of Language

Information Retrieval – Language models for IR

Language Models

Language Models

Program Models for English Language Instruction

Language Models

Cluster Language Models

Factored Language Models

Language Models for TR

LANGUAGE TEACHING MODELS

Language Models For Speech Recognition

Discriminative Models for Spoken Language Understanding

Language Models

KNR 273: TR Models Continued

Large Language Models

Large Language Models

Best Practices for Deploying Language Models