Cumulative Progress in Language Models for Information Retrieval

Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato

Ad-hoc Information Retrieval • Ad-hoc Information Retrieval (IR) forms the basic task in IR: • Given a query, retrieve and rank documents in a collection • Origins: • Cranfield 1 (1958-1960), Cranfield 2 (1962-1966), SMART (1961-1999) • Major evaluations: • TREC Ad-hoc (1990-1999), TREC Robust (2003-2005), CLEF (2000-2009), INEX (2009-2010), NTCIR (1999-2013), FIRE (2008-2013)

Illusionary Progress in Ad-hoc IR • TREC ad-hoc evaluations stopped in 1999, as progress plateaued • More diverse tasks became the foci of research • “There is little evidence of improvement in ad-hoc retrieval technology over the past decade” (Armstrong et al. 2009) • Weak baselines, non-cumulative improvements • ⟶“no way of using LSI achieves a worthwhile improvement in retrieval accuracy over BM25” (Atreya & Elkan, 2010) • ⟶ “there remains very little room for improvement in ad hoc search” (Trotman & Keeler, 2011)

Progress in Language Models for IR? • Language Models (LM) form one of the main approaches to IR • Many improvements to LMs not adopted generally or evaluated systematically • TF-IDF feature weighting • Pitman-Yor Process smoothing • Feedback models • Are these improvements consistent across standard datasets, cumulative, and do they improve on a strong baseline?

Query Likelihood Language Models • Query Likelihood (QL) (Kalt 1996, Hiemstra 1998, Ponte & Croft 1998) is the basic application of LMs for IR • Unigram case: using count vectors to represent documents and queries, rank documents given a query according to • Assuming a generative model , and uniform priors over :

Query Likelihood Language Models 2 • The unigram QL-score for each document becomes: • where is the Multinomial coefficient, and document models are given by the Maximum Likelihood estimates:

Pitman-Yor Process Smoothing • Standard methods for smoothing in IR LMs are Dirichlet Prior (DP) and 2-Stage Smoothing (2SS) (Zhai & Lafferty 2004, Smucker & Allan 2007) • Recent suggested improvement is Pitman-Yor Process smoothing (PYP), an approximation to inference on a Pitman-Yor Process (Momtazi & Klakow 2010, Huang & Renals 2010) • All methods interpolate unsmoothed parameters with a background distribution. PYP additionally discounts the unsmoothed counts

Pitman-Yor Process Smoothing 2 • All methods share the form: • DP: • 2SS: • PYP: , and

Pitman-Yor Process Smoothing 2 • All methods share the form: • DP: • 2SS: • PYP: , and ,

Pitman-Yor Process Smoothing 3 • The background model is most commonly estimated by concatenating all collection documents into a single document: • Less commonly, a uniform background model is used:

TF-IDF Feature Weighting • Multinomial modelling assumptions of text can be corrected with TF-IDF weighting (Rennie et al. 2003, Frank & Bouckaert 2006) • Traditional view: IDF-weighting unnecessary with IR LMs (Zhai& Lafferty 2004) • Recent view: combination is complementary (Smucker & Allan 2007, Momtazi et al. 2010)

TF-IDF Feature Weighting 2 • Dataset documents can be weighted by TF-IDF: • , where is the unweighted count vector, the number of documents, and number of documents where word occurs • First factor is TF log transform using unique length normalization (Singhal et al. 1996) • Second factor is Robertson-Walker IDF(Robertson & Zaragoza 2009)

TF-IDF Feature Weighting 3 • IDF has a overlapping function to collection smoothing (Hiemstra & Kraaij 1998) • Interaction taken into account by replacing collection model by a uniform model in smoothing:

Model-based Feedback • Pseudo-feedback is a traditional method in Ad-hoc IR: • Using the retrieved documents for original query , construct and rank using a new query • With LMs two different formalizations enable model-based feedback: • Kl-Divergence Retrieval (Zhai & Lafferty 2001) • Relevance Models (Lavrenko & Croft 2001) • Both enable replacing the original query counts by a model

Model-based Feedback 2 • Many modeling choices exist for the feedback models, such as: • Using top retrieved documents (commonly ) • Truncating the word vector to words present in the original query • Weighting the feedback documents using • Interpolating the feedback model with the original query • These modeling choices are combined here

Model-based Feedback 3 • The interpolated query model is estimated for the query words from the top document models : • , where is the interpolation weight and is normalizer:

Experimental Setup • Ad-hoc IR experiments conducted on 13 standard datasets • TREC1-5 split according to data source • OHSU-TREC • FIRE 2008-2011 English • Preprocessing: stopword & short word() removal, Porter-stemming • Each dataset split into development and evaluation subsets

Experimental Setup 2 • Software used for experiments was the SGMWeka 1.44 toolkit: • http://sourceforge.net/projects/sgmweka/ • Smoothing parameters optimized on development sets using Gaussian Random Searches (Luke 2009) • Evaluation performed on evaluation sets, using Mean Average Precision of top documents (MAP@50) • Significance tested with paired one-tailed t-tests between the datasets, with

Results • Significant differences: • PYP > DP • PYP+TI > 2SS • PYP+TI+FB > PYP+TI • PYP+TI+FB improves on 2SS by 4.07 MAP@50 absolute, a 17.1% relative improvement

Discussion • The 3 evaluated improvements in language models for IR: • require little additional computation • can be implemented with small modifications to existing IR systems • are substantial, significant and cumulative across 13 standard datasets, compared to DP and 2SS baselines (4.07 MAP@50 absolute, 17.1% relative) • Improvements requiring more computation possible • document neighbourhood smoothing, word correlation models, passage-based LMs, bigram LMs, … • More extensive evaluations needed for confirming progress

Cumulative Progress in Language Models for Information Retrieval

Cumulative Progress in Language Models for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Relevance Models In Information Retrieval

Cross-Language Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Advanced Information- Retrieval Models

Information Retrieval Models

Two-stage Language Models for Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Natural Language Processing for Information Retrieval

Language and Document Models in Information Retrieval

Information Retrieval Models

Language Modeling Frameworks for Information Retrieval

Dependence Language Model for Information Retrieval

Discriminative Models for Information Retrieval