Risk Minimization and Language Modeling in Text Retrieval

Risk Minimization and Language Modeling in Text Retrieval ChengXiang Zhai Thesis Committee: John Lafferty (Chair), Jamie Callan Jaime Carbonell David A. Evans W. Bruce Croft (Univ. of Massachusetts, Amherst)

Information Overflow Web Site Growth

query “Tips on thesis defense” Text Retrieval (TR) database/collection Retrieval System User relevant docs text docs

Utility Challenges in TR Ad hoc parameter tuning (independent,topical) Relevance

Sophisticated Parameter Tuningin the Okapi System “k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000 (effectively infinite).” (Robertson et al. 1999)

Desired Ranking Redundancy Readability More Than “Relevance” Relevance Ranking

Meeting the Challenges Risk Minimization Framework Parameter Estimation Statistical Language Models Bayesian Decision Theory Utility-based Retrieval

Map of Thesis New TR Framework New TR Models Features Two-stage Language Model Automatic parameter setting Risk Minimization Framework KL-divergence Retrieval Model Natural incorporation of feedback Aspect Retrieval Model Non-traditional ranking

? Unordered subset ? … Query Ranked list 1 2 3 4 ? Clustering Retrieval as Decision-Making Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? () Choose: (D,)

U q User Query Partially observed observed d S Document Source inferred Generative Model of Document & Query

Loss L L L queryq userU q Choice: (D1,1) 1 Choice: (D2,2) doc setC sourceS ... Choice: (Dn,n) N loss hidden observed RISK MINIMIZATION Bayes risk for choice (D, ) Bayesian Decision Theory

Special Cases • Set-based models (choose D) • Ranking models (choose ) • Independent loss (  PRP) • Relevance-based loss • Distance-based loss • Dependent loss • MMR loss • MDR loss Boolean model Probabilistic relevance model Vector-space Model Two-stage LM KL-divergence model Aspect retrieval model

Relevance P(d q) or P(q d) Probabilistic inference (R(q), R(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Map of Existing TR Models

Where Are We? Two-stage Language Model Risk Minimization Framework KL-divergence Retrieval Model Aspect Retrieval Model

Loss function U q Stage 2: compute Stage 2 (Mixture model) Stage 1 Two-stage smoothing S d Stage 1: compute (Dirichlet prior smoothing) Two-stage Language Models Risk ranking formula

Keyword queries Verbose queries The Need of Query-Modeling(Dual-Role of Smoothing)

Interaction of the Two Roles of Smoothing

Stage-1 -Explain unseen words -Dirichlet prior(Bayesian) Stage-2 -Explain noise in query -2-component mixture c(w,d) +p(w|C) (1-) + p(w|U)   |d| + P(w|d) = Two-stage Smoothing

w1 Leave-one-out P(w1|d- w1) log-likelihood w2 P(w2|d- w2) Maximum Likelihood Estimator ... wn Newton’s Method P(wn|d- wn) Estimating  using leave-one-out

Automatic 2-stage results  Optimal 1-stage results Average precision (3 DB’s + 4 query types, 150 topics)

Loss function Risk ranking formula U q S d KL-divergence Retrieval Models

modify Expansion-based Feedback Model-based Feedback modify Expansion-based vs. Model-based Doc model Scoring Document D Results Query Q Query likelihood Feedback Docs Doc model Document D Scoring Results KL-divergence Query model Query Q Feedback Docs

=0 =1 No feedback Full feedback Feedback as Model Interpolation Document D Results Query Q Feedback Docs F={d1, d2 , …, dn} Generative model Divergence minimization

Background words w P(w| C)  F={d1,…,dn} P(source) Topic words w 1- P(w|  ) Maximum Likelihood F Estimation Method I: Generative Mixture Model

d1 Background model close C  F={d1,…,dn} far () dn F Estimation Method II:Empirical Divergence Minimization Empirical divergence Divergence minimization

Example of Feedback Query Model Trec topic 412: “airport security” Mixture model approach Web database Top 10 docs =0.9 =0.7

Model-based feedback vs. Simple LM

Aspect Retrieval Query: What are the applications of robotics in the world today? Find as many DIFFERENT applications as possible. Aspect judgments A1 A2 A3 … ... Ak d1 1 1 0 0 … 0 0 d2 0 1 1 1 … 0 0 d3 0 0 0 0 … 1 0 …. dk 1 0 1 0 ... 0 1 Example Aspects: A1: spot-welding robotics A2: controlling inventory A3: pipe-laying robots A4: talking robot A5: robots for loading & unloading memory tapes A6: robot [telephone] operators A7: robot cranes … …

#doc 1 2 3 … … #asp 2 5 8 … … #uniq-asp 2 4 5 AC: 2/1=2.0 4/2=2.0 5/3=1.67 AU: 2/2=1.0 4/5=0.8 5/8=0.625 Accumulated counts Evaluation Measures • Aspect Coverage (AC): measures per-doc coverage • #distinct-aspects/#docs • Equivalent to the “set cover” problem, NP-hard • Aspect Uniqueness(AU): measures redundancy • #distinct-aspects/#aspects • Equivalent to the “volume cover” problem, NP-hard • Examples 0 0 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 … ... d1 d2 d3

Maximal Marginal Relevance (MMR)  1 Novelty/Redundancy Nov ( k+1| 1 … k) The best dk+1 is novel & relevant  k Relevance Rel( k+1) ? dk+1 k+1 Maximal Diverse Relevance (MDR) Aspect Coverage Distrib. p(a|i)  1 The best dk+1 is complementary in coverage  k k+1 Loss Function L( k+1| 1 … k) known d1 … dk

Maximal Marginal Relevance (MMR) Models • Maximizing aspect coverage indirectly through redundancy elimination • Elements • Redundancy/Novelty measure • Combination of novelty and relevance • Proposed & studied six novelty measures • Proposed & studied four combination strategies

Comparison of Novelty Measures (Aspect Coverage)

Comparison of Novelty Measures (Aspect Uniqueness)

Ref. document Maximum Likelihood Expectation-Maximization P(w|Old) Collection P(w|Background) A Mixture Model for Redundancy =?  1-

Cost-based Combination of Relevance and Novelty Relevance score Novelty score

Maximal Diverse Relevance (MDR) Models • Maximizing aspect coverage directly through aspect modeling • Elements • Aspect loss function • Generative Aspect Model • Proposed & studied KL-divergence aspect loss function • Explored two aspect models (PLSI, LDA)

U q User Query d Document S Source PLSI: LDA: Aspect Generative Model of Document & Query  =(1,…, k)

U q  S d Aspect Loss Function

perfect redundant “Already covered” p(a|1)... p(a|k -1) non-relevant New candidate p(a|k) Combined coverage Aspect Loss Function: Illustration Desired coverage p(a|Q)

Preliminary Evaluation: MMR vs. MDR • On the relevant data set, both MMR and MDR are effective, but they complement each other • - MMR improves AU more than AC • - MDR improves AC more than AU • On the mixed data set, however, • - MMR is only effective when relevance ranking is accurate • - MDR improves AC, even though relevance ranking is degraded.

Further Work is Needed • Controlled experiments with synthetic data • Level of redundancy • Density of relevant documents • Per-document aspect counts • Alternative loss functions • Aspect language models, especially along the line of LDA • Aspect-based feedback

New TR Models Specific Contributions • Empirical study of smoothing (dual role of smoothing) • New smoothing method (two-stage smoothing) • Automatic parameter setting (leave-one-out, mixture) New TR Framework Two-stage Language Model Risk Minimization Framework • Query/document distillation • Feedback with LMs (mixture model & div. min.) KL-divergence Retrieval Model • Unifies existing models • Incorporates LMs • Serves as a map for • exploring new models • Evaluation criteria (AC, AU) • Redundancy/novelty measures (mixture weight) • MMR with LMs (cost-comb.) • Aspect-based loss function (“collective KL-div”) Aspect Retrieval Model Summary of Contributions

Future Research Directions • Better Approximation of the risk integral • More effective LMs for “traditional” retrieval • Can we beat TF-IDF without increasing computational complexity? • Automatic parameter setting, especially for feedback models • Flexible passage retrieval, especially with HMM • Beyond unigrams (more linguistics)

More Future Research Directions • Aspect Retrieval Models • Document structure/sub-topic modeling • Aspect-based feedback • Interactive information retrieval models • Risk minimization for information filtering • Personalized & context-sensitive retrieval

Thank you!

Risk Minimization and Language Modeling in Text Retrieval

Risk Minimization and Language Modeling in Text Retrieval

Presentation Transcript

Information Retrieval and Text Mining

Semantic-based Language Models for Text Retrieval and Clustering

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Modeling and Solving Term Mismatch for Full-Text Retrieval

CS276A Text Retrieval and Mining

Active Learning in Text Retrieval

Visualization in Text Information Retrieval

CS276A Text Retrieval and Mining

Text-retrieval Systems

A Language Modeling Approach to Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval

Language Modeling Frameworks for Information Retrieval

Modeling and Solving Term Mismatch for Full-Text Retrieval

Regularized risk minimization

Challenges in Information Retrieval and Language Modeling

CS276A Text Retrieval and Mining

Statistical Language Modeling for Speech Recognition and Information Retrieval

Text retrieval systems