320 likes | 537 Views
A Field Relevance Model . for Structured Document Retrieval. JIN YOUNG KIM @ ECIR 2012. Three Themes. The Concept of Field Relevance Using Field Relevance for Retrieval The Estimation of Field Relevance. Relevance. Field Weighting. Field Relevance. T he Field Relevance.
E N D
A Field Relevance Model for Structured Document Retrieval JIN YOUNG KIM @ ECIR 2012
Three Themes • The Concept of Field Relevance • Using Field Relevance for Retrieval • The Estimation of Field Relevance Relevance Field Weighting Field Relevance
IR : The Quest for Relevance • The Role of Relevance • Core Component of Retrieval Models • Basis of (Pseudo) Relevance Feedback • Retrieval Models based on the Relevance • Binary Independence Model (BM25) [Robertson76] • Relevance-based Language Model [Lavrenko01] P(w|R) V = (w1w2 ... wm)
Structured Document Retrieval • Documents have multiple fields • Emails, products (entities), and so on. • Retrieval models exploit the structure • Field weighting is common q1 q2 ... qm f1 f1 w1 w1 f2 f2 w2 w2 sum ... ... fn fn wn wn multiply
Relevance for Structured Document Retrieval • Term-level Relevance • Which term is important for user’s information need? • Field-level Relevance • Which field is important for user’s information need? • Term-level relevance • Field-level relevance P(w|R) P(F|R) F = (F1 F2 … Fn) V = (w1w2 ... wm)
Defining the Field Relevance Field Relevance P(F|w,R) per-term • The distribution of per-term relevance over document fields q1… qi… qm Query: m words Collection: n fields for each document Q= (q1q2... qm) F1 … Fj … Fn F = (F1 F2 … Fn) P(F|qm,R) P(F|qi,R) P(F|q1,R)
Why P(F|w,R) instead of P(F|R)? • Different fields are relevant for different query-term Query: ‘james registration’ ‘registration’ is relevant when it occurs in <subject> 1 1 2 2 1 2 ‘james’ is relevant when it occurs in <to>
More Evidence for the Field Relevance • Field Operator / Advanced Search Interface • User’s search terms are found in multiple fields Understanding Re-finding Behavior in Naturalistic Email Interaction Logs. Elsweiler, D, Harvey, M, Hacker., M [SIGIR'11] • Evaluating Search in Personal Social Media Collections Chia-Jung L, Croft, W.B., Kim, J[WSDM12]
Retrieval over Structured Documents • Field-based Retrieval Models • Score each field against each query-term • Combine field-level scores using field weights Fixed field weights wj can be too restrictive
Using the Field Relevance for Retrieval • Field Relevance Model • Comparison with Mixture of Field Language Model q1 q2 ... qm q1 q2 ... qm f1 f1 f1 f1 w1 w1 P(F1|q1) P(F1|qm) f2 f2 f2 f2 w2 w2 P(F2|q1) P(F2|qm) ... ... ... ... Per-term Field Score sum fn fn fn fn wn wn P(Fn|q1) P(Fn|qm) Per-term Field Weight multiply
Structured Document Retrieval: PRM-S [Kim, Xue, Croft 09] • Probabilistic Retrieval Model for Semi-structured data • Estimate the mapping between queryterms and doc. fields • Use the mapping probability as per-term field weights Estimation is based on limited sources.
Using the Field Relevance for Retrieval • Field Relevance Model • Comparison with the PRM-S • FRM has the same functional form to PRM-S • FRM differs in how per-term field weights are estimated Per-term Field Score Per-term Field Weight Per-term Field Weight
Estimating Field Relevance: in a Nutshell • If User Provides Feedback • Relevant document provides sufficient information • If No Feedback is Available • Combine field-level term statistics from multiple sources from/to from/to + ≅ title title content content Collection Top-k Docs from/to title content Relevant Docs
Estimating Field Relevance using Feedback • Assume a user who marked DR as relevant • Estimate field relevance from the field-level term dist. of DR • We can personalize the results accordingly • Rank higher docs with similar field-level term distribution Field Relevance: - To is relevant for ‘james’ - Content is relevant for ‘registration’ DR
Estimating Field Relevance without Feedback • Method • Linear Combination of Multiple Sources • Weights estimated using training queries • Features • Field-level term distribution of the collection • Unigram and Bigram LM • Field-level term distribution of top-k docs • Unigram and Bigram LM • A priori importance of each field (wj) • Estimated using held-out training queries Unigram is the same to PRM-S Pseudo-relevance Feedback Similar to MFLM and BM25F
Experimental Setup • Collections • TREC Emails • IMDB Movies • Monster Resumes • Distribution of the Most Relevant Field
Query Examples (Indri) • Oracle Estimates of Field Relevance TREC IMDB Monster
Retrieval Methods Compared • Baselines • DQL / BM25F • MFLM : fixed regardless of terms • PRM-S : estimated using the collection • Field Relevance Models • FRM-C : estimated using the combination • FRM-O : estimated using relevant documents Differs only in terms of the field weighting!
Retrieval Effectiveness (Metric: Mean Reciprocal Rank) Per-term Field Weights Fixed Field Weights
Quality of Field Relevance Estimation • Aggregated KL-Divergence from Oracle Estimates • Aggregated Cosine Similarity with Oracle Estimates
Feature Ablation Results • Features Revisited • Field-level term distribution of the collection (PRM-S) • Field-level term distribution of top-k documents • A priori relevance of term (prior) • Results for TREC Collection
Summary • Field relevance as a generalization of field weighting • Relevance modeling for structured document retrieval • Field relevance model for structured doc. retrieval • Using field relevance to combine per-field LM scores • Estimating the field relevance using relevant docs • Providing a natural way to incorporate relevance feedback • Estimating the field relevance by combining sources • Improved performance over MFLM and PRM-S
Ongoing Work • Large-scale batch evaluation on a book collection • Test collections built using OpenLibrary.org query logs • Evaluation of the relevance feedback on FRM • Does relevance feedback improves on subsequent results? • Integrating the term relevance and field relevance • Further improvement is expected when combined Term Relevance Field Relevance
I’m on the job market! More at @jin4ir, or cs.umass.edu/~jykim • Structured Document Retrieval • A Probabilistic Retrieval Model for Semi-structured Data [ECIR09] • A Field Relevance Model for Structured Document Retrieval [ECIR11] • Personal Search • Retrieval Experiments using Pseudo-Desktop Collections [CIKM09] • Ranking using Multiple Document Types in Desktop Search [SIGIR10] • Evaluating an Associative Browsing Model for Personal Info. [CIKM11] • Evaluating Search in Personal Social Media Collections [WSDM12] • Web Search • An Analysis of Instability for Web Search Results [ECIR10] • Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic [WSDM12]
Optimality of Field Relevance Estimation • This results in the optimal field weighting • Scores DR as highly as possible against other docs • Under the language modeling framework for IR Per-term Field Score Proof on the extended version Per-term Field Weight
Features based on Field-level Term Dists. • Summary • Estimation Unigram LM (= PRM-S) Bigram LM