Enhancing Information Retrieval with Model-Based Feedback Approach

Model-based Feedback in the Language Modeling Approach to Information Retrieval Chengxiang Zhai and John Lafferty School of Computer Science Carnegie Mellon University

Outline • The Language Modeling Approach to IR • Feedback: Expansion-based vs. Model-based • Two Model-based feedback algorithms • Evaluation • Conclusions & Future Work

Text Retrieval (TR) • Given a query, find relevant documents in a document collection ( Ranking documents) • Many applications (Web pages, News, Email, …) • Many models developed (vector space, probabilistic) • The “language modeling approach” is a new model that is promising …

Document language model Retrieval as Language Model Estimation • Document ranking based on query likelihood(Ponte & Croft 98, Miller et al. 99, Berger & Lafferty 99, Hiemstra 2000, etc.) • Retrieval problem  Estimation of p(wi|d) • Many advantages:good statistical foundation, reuse existing LM methods ... • But, feedback is awkward …

Feedback in Text Retrieval • Learning from examples • In effect, new, related terms are extracted to enhance the original query • Generally leads to performance increase (both average precision and recall)

Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback Relevance Feedback

top 10 Pseudo/Blind/Automatic Feedback Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... Feedback

Feedback in the Language Modeling Approach • Mostly expansion-based : adding new terms to query (Ponte 1998, Miller et al. 1999, Ng 1999) • Query term reweighting, no expansion(Hiemstra 2001) • Implicit feedback(Berger & Lafferty 99) • Conceptual inconsistency in expansion-based approaches • Original query : as text • Expanded query: as text + {terms}

Answer: Introduce a query model & treat feedback as query model updating Retrieval function: Query-likelihood => KL-Divergence Feedback: Expansion-based => Model-based Question: How to exploit language modeling to perform natural and effective feedback?

A KL-Divergence Unigram Retrieval Model • A special case of the general risk minimization retrieval framework (Lafferty & Zhai 2001) • Retrieval formula • Retrieval  Estimation of Q and D • Special case: = empirical distribution of q recovers “query-likelihood” query entropy (ignored for ranking)

modify Expansion-based Feedback Model-based Feedback modify Expansion-based vs. Model-based Doc model Scoring Document D Results Query Q Query likelihood Feedback Docs Doc model Document D Scoring Results KL-divergence Query model Query Q Feedback Docs

Feedback as Model Interpolation ML+smooth Document D Results Query Q ML Feedback Docs F={d1, d2 , …, dn} =0 =1 Generative model Divergence minimization No feedback Full feedback

Background words w P(w| C)  F={d1,…,dn} P(source) Topic words w 1- P(w|  ) Maximum Likelihood Use EM to find F F Estimation Method I: Generative Mixture Model

d1 close C  F={d1,…,dn} far () dn Empirical divergence Divergence minimization Given F, C, , solution is F Estimation Method II:Empirical Divergence Minimization

Example of Feedback Query Model Trec topic 412: “airport security” Mixture model approach Web database Top 10 docs =0.9 =0.7

Model-based feedback vs. Simple LM

Div. Min less sensitive Mixture model more sensitive origial query model =0 feedback model only =1 Sensitivity of Precision to 

Mixture model less sensitive No feedback Div. min. more sensitive More common words “ignored” Sensitivity of Precision to  (Mixture Model & Divergence Min., =0.5) Over discrimination can be harmful

The Lemur Toolkit • Language Modeling and Information Retrieval Toolkit • Under development at CMU and UMass • All experiments reported here were run using Lemur • http://www.cs.cmu.edu/~lemur • Contact us if you are interested in using it

Conclusions • Model-based feedback is natural and effective • Performance is sensitive to both  and  • Mixture model: more sensitive to , but less to  (0.5) • Divergence min: more sensitive to , but less to  (0.3) • The sensitivity suggests more robust models are needed. E.g., use query to focus the model • Markov chain query model (Lafferty & Zhai, 2001) • Relevance language model(Lavrenko & Croft, 2001)

Future Work • Evaluating methods for relevance feedback • Examples in pseudo feedback can be quite noisy • Relevance feedback better reflects “learning ability” • More robust feedback models, e.g., • Query-focused feedback (e.g., Query translation model) • Passage-based feedback (e.g., Hidden Markov model)

Enhancing Information Retrieval with Model-Based Feedback Approach

Enhancing Information Retrieval with Model-Based Feedback Approach

Presentation Transcript

Model-based approach

Information Retrieval Model

Gravitation-Based Model for Information Retrieval

Cross-Language Information Retrieval

The Language Based Approach

Cross-Language Information Retrieval

Probabilistic Language-Model Based Document Retrieval

Exploring Sentence Level Query Expansion in Language Modeling Based Information Retrieval

Set-Based Model: A New Approach for Information Retrieval

Multifaceted Approach to Biomedical Information Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval

An Information Retrieval Approach based on Discourse Type

A Discourse-based Information Retrieval Approach

A Language Modeling Approach to Information Retrieval

Language Modeling Frameworks for Information Retrieval

Challenges in Information Retrieval and Language Modeling

Dependence Language Model for Information Retrieval

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Probabilistic Language-Model Based Document Retrieval