200 likes | 414 Views
Cumulative Progress in Language Models for Information Retrieval. Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato. Ad-hoc Information Retrieval. Ad-hoc Information Retrieval (IR) forms the basic task in IR:
E N D
Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato
Ad-hoc Information Retrieval • Ad-hoc Information Retrieval (IR) forms the basic task in IR: • Given a query, retrieve and rank documents in a collection • Origins: • Cranfield 1 (1958-1960), Cranfield 2 (1962-1966), SMART (1961-1999) • Major evaluations: • TREC Ad-hoc (1990-1999), TREC Robust (2003-2005), CLEF (2000-2009), INEX (2009-2010), NTCIR (1999-2013), FIRE (2008-2013)
Illusionary Progress in Ad-hoc IR • TREC ad-hoc evaluations stopped in 1999, as progress plateaued • More diverse tasks became the foci of research • “There is little evidence of improvement in ad-hoc retrieval technology over the past decade” (Armstrong et al. 2009) • Weak baselines, non-cumulative improvements • ⟶“no way of using LSI achieves a worthwhile improvement in retrieval accuracy over BM25” (Atreya & Elkan, 2010) • ⟶ “there remains very little room for improvement in ad hoc search” (Trotman & Keeler, 2011)
Progress in Language Models for IR? • Language Models (LM) form one of the main approaches to IR • Many improvements to LMs not adopted generally or evaluated systematically • TF-IDF feature weighting • Pitman-Yor Process smoothing • Feedback models • Are these improvements consistent across standard datasets, cumulative, and do they improve on a strong baseline?
Query Likelihood Language Models • Query Likelihood (QL) (Kalt 1996, Hiemstra 1998, Ponte & Croft 1998) is the basic application of LMs for IR • Unigram case: using count vectors to represent documents and queries, rank documents given a query according to • Assuming a generative model , and uniform priors over :
Query Likelihood Language Models 2 • The unigram QL-score for each document becomes: • where is the Multinomial coefficient, and document models are given by the Maximum Likelihood estimates:
Pitman-Yor Process Smoothing • Standard methods for smoothing in IR LMs are Dirichlet Prior (DP) and 2-Stage Smoothing (2SS) (Zhai & Lafferty 2004, Smucker & Allan 2007) • Recent suggested improvement is Pitman-Yor Process smoothing (PYP), an approximation to inference on a Pitman-Yor Process (Momtazi & Klakow 2010, Huang & Renals 2010) • All methods interpolate unsmoothed parameters with a background distribution. PYP additionally discounts the unsmoothed counts
Pitman-Yor Process Smoothing 2 • All methods share the form: • DP: • 2SS: • PYP: , and
Pitman-Yor Process Smoothing 2 • All methods share the form: • DP: • 2SS: • PYP: , and ,
Pitman-Yor Process Smoothing 3 • The background model is most commonly estimated by concatenating all collection documents into a single document: • Less commonly, a uniform background model is used:
TF-IDF Feature Weighting • Multinomial modelling assumptions of text can be corrected with TF-IDF weighting (Rennie et al. 2003, Frank & Bouckaert 2006) • Traditional view: IDF-weighting unnecessary with IR LMs (Zhai& Lafferty 2004) • Recent view: combination is complementary (Smucker & Allan 2007, Momtazi et al. 2010)
TF-IDF Feature Weighting 2 • Dataset documents can be weighted by TF-IDF: • , where is the unweighted count vector, the number of documents, and number of documents where word occurs • First factor is TF log transform using unique length normalization (Singhal et al. 1996) • Second factor is Robertson-Walker IDF(Robertson & Zaragoza 2009)
TF-IDF Feature Weighting 3 • IDF has a overlapping function to collection smoothing (Hiemstra & Kraaij 1998) • Interaction taken into account by replacing collection model by a uniform model in smoothing:
Model-based Feedback • Pseudo-feedback is a traditional method in Ad-hoc IR: • Using the retrieved documents for original query , construct and rank using a new query • With LMs two different formalizations enable model-based feedback: • Kl-Divergence Retrieval (Zhai & Lafferty 2001) • Relevance Models (Lavrenko & Croft 2001) • Both enable replacing the original query counts by a model
Model-based Feedback 2 • Many modeling choices exist for the feedback models, such as: • Using top retrieved documents (commonly ) • Truncating the word vector to words present in the original query • Weighting the feedback documents using • Interpolating the feedback model with the original query • These modeling choices are combined here
Model-based Feedback 3 • The interpolated query model is estimated for the query words from the top document models : • , where is the interpolation weight and is normalizer:
Experimental Setup • Ad-hoc IR experiments conducted on 13 standard datasets • TREC1-5 split according to data source • OHSU-TREC • FIRE 2008-2011 English • Preprocessing: stopword & short word() removal, Porter-stemming • Each dataset split into development and evaluation subsets
Experimental Setup 2 • Software used for experiments was the SGMWeka 1.44 toolkit: • http://sourceforge.net/projects/sgmweka/ • Smoothing parameters optimized on development sets using Gaussian Random Searches (Luke 2009) • Evaluation performed on evaluation sets, using Mean Average Precision of top documents (MAP@50) • Significance tested with paired one-tailed t-tests between the datasets, with
Results • Significant differences: • PYP > DP • PYP+TI > 2SS • PYP+TI+FB > PYP+TI • PYP+TI+FB improves on 2SS by 4.07 MAP@50 absolute, a 17.1% relative improvement
Discussion • The 3 evaluated improvements in language models for IR: • require little additional computation • can be implemented with small modifications to existing IR systems • are substantial, significant and cumulative across 13 standard datasets, compared to DP and 2SS baselines (4.07 MAP@50 absolute, 17.1% relative) • Improvements requiring more computation possible • document neighbourhood smoothing, word correlation models, passage-based LMs, bigram LMs, … • More extensive evaluations needed for confirming progress