Two-stage Language Models for Information Retrieval

Two-stage Language Models for Information Retrieval ChengXiang Zhai*, John Lafferty School of Computer Science Carnegie Mellon University *New Address Department of Computer Science University of Illinois, Urbana-Champaign

Motivation • Retrieval parameters are needed to • model different user preferences • customize a retrieval model according to different queries and documents • So far, parameters have been set through empirical experimentation • Can we set parameters automatically?

Parameters in Traditional Models • EXTERNAL to the model, hard to interpret • Most parameters are introduced heuristically to implement our “intuition” • As a result, no principles to quantify them • Set through empirical experiments • Lots of experimentation • Optimality for new queries is not guaranteed

Example of Parameter Tuning (Okapi) “k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000 (effectively infinite).” (Robertson et al. 1999)

The Way to Automatic Tuning ... • Parameters must be PART of the model! • Query modeling (explain difference in query) • Document modeling (explain difference in doc) • De-couple the influence of a query on parameter setting from that of documents • To achieve stable setting of parameters • To pre-compute query-independent parameters

The Rest of the Talk Risk Minimization Retrieval Framework Two-stage Language Models Two-stage Dirichlet-Mixture smoothing Parameter estimation

The Risk Minimization Framework(Lafferty & Zhai 01, Zhai 02) QUERY MODELING Query Language Model Query USER MODELING ? User Retrieval Decision: Loss Function Documents Document Language Models DOC MODELING

Parameter Setting in Risk Minimization Estimate Estimate Query model parameters Set User model parameters Doc model parameters Query Language Model Query User Loss Function Documents Document Language Models

Two-stage Language Models stage-2 stage-1 Risk ranking formula 1 2 Query Query Language Model q Loss Function Smoothing! d Document Language Model Doc

Sensitivity in Traditional (“one-stage”) Smoothing Keyword Verbose (sentence-like)

The Need of Two-stage Smoothing (I) Accurate Estimation of Doc Model Language Model P(w|d) Document Query = “data mining algorithms” … text 10/500=0.02 mining 3/500=0.006 assocation 1/500=0.002 algorithm 2/500=0.004 … data 0/500=0 … ? Text mining paper p(q) = p(“data”|d)p(“mining”|d)p(“algorithm”|d) = 0*0.006*0.004 = 0! P(“data”|d) = ? P(“unicorn”|d) = ?

Two-stage Dirichlet-Mixture Smoothing Stage-1 Smoothing -Explain unseen words -Dirichlet prior -Add pseudo counts Stage-2 Smoothing -Explain noise in query -2-component mixture -Linear interpolation c(w,d) +p(w|C) (1-) + p(w|U)   |d| + P(w|d) =

Estimating  using leave-one-out w1 Leave-one-out P(w1|d- w1) log-likelihood w2 P(w2|d- w2) Maximum Likelihood Estimator ... wn Newton’s Method P(wn|d- wn)

Effectiveness of Parameter Estimation • Five databases • News articles (AP, WSJ, ZIFF, FBIS, FT, LA) • Government documents (Federal Register) • Web pages • Four types of queries • Long vs. short • Verbose (sentence-like) vs. keyword • Results: Automatic 2-stage  Optimal 1-stage

Automatic 2-stage results  Optimal 1-stage results Average precision (3 DB’s + 4 query types, 150 topics)

Automatic 2-stage results  Optimal 1-stage results Average precision ( 2 large DB’s + 2 query types, 50 topics)

Conclusions • Two-stage language models • Direct modeling of both queries and documents • Parameters are part of a probabilistic model • Parameters can be estimated using standard estimation techniques • Two-stage Dirichlet-Mixture smoothing • Involves two meaningful parameters (I.e., document sample size and query noise) • Achieves very good performance through automatically setting smoothing parameters • It is possible to set parameters automatically!

Future Work • Optimality analysis in the two-stage parameter space • Offline vs. online estimation • Alternative estimation methods • Parameter estimation for more sophisticated language models (e.g., with feedback)

Thank you!

Two-stage Language Models for Information Retrieval

Two-stage Language Models for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Cross-Language Information Retrieval

Cumulative Progress in Language Models for Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Advanced Information- Retrieval Models

Information Retrieval Models

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Natural Language Processing for Information Retrieval

Language and Document Models in Information Retrieval

Information Retrieval Models

Language Modeling Frameworks for Information Retrieval

Dependence Language Model for Information Retrieval

Discriminative Models for Information Retrieval