280 likes | 382 Views
A Study of Poisson Query Generation Model for Information Retrieval. Qiaozhu Mei, Hui Fang, and ChengXiang Zhai University of Illinois at Urbana-Champaign. Outline. Background of query generation in IR Query generation with Poisson language model Smoothing in Poisson query generation model
E N D
A Study of Poisson Query Generation Model forInformation Retrieval Qiaozhu Mei, Hui Fang, and ChengXiang Zhai University of Illinois at Urbana-Champaign
Outline • Background of query generation in IR • Query generation with Poisson language model • Smoothing in Poisson query generation model • Poisson v.s. multinomial in query generation IR • Analytical comparison • Empirical experiments • Summary
Query Generation IR Model [Ponte & Croft 98] Document Language Model QueryLikelihood d1 q d2 dN • Scoring documents with query likelihood • Known as the language modeling (LM)approach to IR • Different from document generation
Interpretation of LM d • d: a model for queries posed by users who like document d[Lafferty and Zhai 01] • Estimate d using document d use d to approximate the queries used by users who like d • Existing methods differ mainly in the choice of d and how d is estimated (smoothing) • Multi-Bernoulli: e.g, [Ponte and Croft 98, Metzler et al 04] • Multinomial: (most popular) e.g., [Hiemstra et al. 99, Miller et al. 99, Zhai and Lafferty 01]
Query q: “text mining” T H H Multinomial: Toss a die to choose a word text Query q: “text mining” model text mining mining Multi-Bernoulli vs. Multinomial Multi-Bernoulli: Flip a coin for each word Doc: d text mining … model text mining model clustering text model text …
Problems of Multinomial • Does not model term absence • Sum-to-one over all terms • Reality is harder than expected: • Empirical estimates: mean (tf) < variance (tf) (Church & Gale 95) • Estimates on AP88-89: • All terms: : 0.0013; 2: 0.0044 • Query terms: : 0.1289; 2: 0.3918 • Multinomial/Bernoulli: mean > variance
Poisson? • Poisson models frequency directly (including zero freq.) • No sum-to-one constraint on different w • Mean = Variance • Poisson is explored in document generation models, but not in query generation models
Related Work • Poisson has been explored in document generation models, e.g., • 2-Poisson Okapi/BM25 (Robertson and Walker 94) • Parallel derivation of probabilistic models (Roelleke and Wang 06) • Our work add to this body of exploration of Poisson • With query generation framework • Explore specific features Poisson brings in LM
Research Questions • How can we model query generation with Poisson language model? • How can we smooth such a Poisson query generation model? • How is a Poisson model different from a multinomial model in the context of query generation retrieval?
Query Generation with Poisson Poisson: Each term as an emitter Query: receiver Rates of arrival of w: : |q| text mining model clustering text model text … 1 text [ ] 3/7 2 mining [ ] 2/7 0 model [ ] / 1/7 0 clustering [ ] / 1/7 1 [ ] … Query: “mining text mining systems”
MLE Query Generation with Poisson (II) q = ‹c(w1, q), c(w2 , q), …, c(wn , q)› |q| [ c(w1, q) ] text mining model clustering text model text … text w1 1 [ c(w2, q) ] mining w2 2 [ c(w3, q) ] w3 model 3 [ c(w4, q) ] clustering w4 4 [ c(wN, q) ] … wN N
Background Collection text 0.02 mining 0.01 model 0.02 … system 0 text 0.0001 mining 0.0002 model 0.0001 … system 0.0001 Smoothing Poisson LM text mining model clustering text model … Query: text mining systems + ? e.g., text: * 0.02 + (1- )* 0.0001 system: * 0 + (1- )* 0.0001 Different smoothing methods lead to different retrieval formulae
1 Smoothing Poisson LM + • Interpolation (JM): • Bayesian smoothing with Gamma prior: • Two stage smoothing: Gamma prior 2
Smoothing Poisson LM (II) • Two-stage smoothing: • Similar to multinomial 2-stage (Zhai and Lafferty 02) • Verbose queries need to be smoothed more 3 A smoothed version of document model (from and ) e.g., A background model of user query preference Use when no user prior is known
Analytical: Equivalency of basic models • Equivalent with basic model and MLE: • Poisson + Gamma Smoothing = multinomial + Dirichlet Smoothing • Basic model + JM smoothing behaves similarly (with a variant component of document length normalization )
Benefits: Per-term Smoothing • Poisson doesn’t require “sum-to-one” over different terms (different event space) • Thus in JM smoothing and 2-stage smoothing can be made term dependent (per-term) • multinomial cannot achieve per-term smoothing • Can use EM algorithm to estimate ws. w w
Benefits: Modeling Background • Traditional: as a single model • Not matching the reality • as a mixture model: increase variance • multinomial mixture (e.g., clusters, PLSA, LDA) • Inefficient (no close form, iterative estimation) • Poisson mixture (e.g., Katz’s K-Mixture, 2-Poisson, Negative Binomial) (Church & Gale 95) • Have close forms, efficient computation
Hypotheses • H1: With basic query generation retrieval models (JM smoothing and Gamma smoothing): Poisson behaves similarly to multinomial • H2: Per-term smoothing with Poisson may out-perform term independent smoothing • More help on verbose queries • H3: Background efficiently modeled as Poisson mixtures may perform better than single Poisson
Experiment Setup • Data: TREC collections and Topics • AP88-89, Trec7, Trec8, Wt2g • Query type: • Short keyword (keyword title); • Short verbose (one sentence); • Long verbose (multiple sentences); • Measurement: • Mean average precision (MAP)
H1: Basic models behave similarly • JM+Poisson JM+Multinomial • Gamma/Dirichlet > JM (Poisson/ Multinomial) • JM + Poisson JM + Multinomial • Gamma/Dirichlet > JM (Poisson/ Multinomial) MAP
H2: Per-term outperforms term-independent smoothing Per-term > Non-per-term
Improvement Comes from Per-term JM + Per-term > JM 2-stage + Per-term > 2-stage Significant improvement on verbose query
H3: Poisson Mixture Background Improves Performance Katz’ K-Mixture > Single Poisson
Poisson Opens Other Potential Flexibilities • Document length penalization? • JM introduced a variant component of document length normalization • Require more expensive computation • Pseudo-feedback? • in the 2-stage smoothing • Use feedback documents to estimate term dependent ws. • Lead to future research directions
Summary • Poisson: Another family of retrieval models based on query generation • Basic models behave similarly to multinomial • Benefits: per-term smoothing and efficient mixture background model • Many other potential flexibilities • Future work: • explore document length normalization and pseudo-feedback • better estimation of per-term smoothing coefficients