820 likes | 928 Views
Formal Retrieval Frameworks. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu. Outline.
E N D
Formal Retrieval Frameworks ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu
Outline • Risk Minimization Framework [Lafferty & Zhai 01, Zhai & Lafferty 06] • Axiomatic Retrieval Framework [Fang et al. 04, Fang & Zhai 05, Fang & Zhai 06]
Risk Minimization: Motivation • Long-standing IR Challenges • Improve IR theory • Develop theoretically sound and empirically effective models • Go beyond the limited traditional notion of relevance (independent, topical relevance) • Improve IR practice • Optimize retrieval parameters automatically • SLMs are very promising tools … • How can we systematically exploit SLMs in IR? • Can SLMs offer anything hard/impossible to achieve in traditional IR?
Long-Standing IR Challenges • Limitations of traditional IR models • Strong assumptions on “relevance” • Independent relevance • Topical relevance • Can we go beyond this traditional notion of relevance? • Difficulty in IR practice • Ad hoc parameter tuning • Can’t go beyond “retrieval” to support info. access in general
More Than “Relevance” Desired Ranking Redundancy Readability Relevance Ranking
Retrieval Parameters • Retrieval parameters are needed to • model different user preferences • customize a retrieval model according to different queries and documents • So far, parameters have been set through empirical experimentation • Can we set parameters automatically?
Systematic Applications of Language Models to IR • Many different variants of language models have been developed, but are there many more models to be studied? • Can we establish a road map for exploring language models in IR?
Two Main Ideas of the Risk Minimization Framework • Retrieval as a decision process • Systematic language modeling
Idea 1: Retrieval as Decision-Making(A more general notion of relevance) ? Unordered subset ? … Ranked list Query 1 2 3 4 ? Clustering Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? () Choose: (D,)
Idea 2: Systematic Language Modeling ? Retrieval Decision: QUERY MODELING Query Language Model Query USER MODELING User Loss Function Documents Document Language Models DOC MODELING
Generative Model of Document & Query [Lafferty & Zhai 01b] U q User Query Partially observed observed R d S Document Source inferred
Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06] Loss L L L queryq userU q Choice: (D1,1) 1 Choice: (D2,2) doc setC sourceS ... Choice: (Dn,n) N loss hidden observed RISK MINIMIZATION Bayes risk for choice (D, )
Benefits of the Framework • Systematic exploration of retrieval models (covering almost all the existing retrieval models as special cases) • Derive general retrieval principles (risk ranking principle) • Automatic parameter setting • Go beyond independent-relevance (subtopic retrieval)
Special Cases of Risk Minimization • Set-based models (choose D) • Ranking models (choose ) • Independent loss • Relevance-based loss • Distance-based loss • Dependent loss • MMR loss • MDR loss Boolean model Probabilistic relevance model Generative Relevance Theory Vector-space Model Two-stage LM KL-divergence model Subtopic retrieval model
Case 1: Two-stage Language Models Loss function Risk ranking formula U q Stage 2: compute Stage 2 (Mixture model) Stage 1 Two-stage smoothing S d Stage 1: compute (Dirichlet prior smoothing)
Case 2: KL-divergence Retrieval Models Loss function Risk ranking formula U q S d
Case 3: Aspect Generative Model of Document & Query U q User Query d Document S Source PLSI: LDA: =(1,…, k)
Optimal Ranking for Independent Loss “Risk ranking principle” [Zhai 02] Decision space = {rankings} Sequential browsing Independent loss Independent risk = independent scoring
Automatic Parameter Tuning • Retrieval parameters are needed to • model different user preferences • customize a retrieval model to specific queries and documents • Retrieval parameters in traditional models • EXTERNAL to the model, hard to interpret • Parameters are introduced heuristically to implement “intuition” • No principles to quantify them, must set empirically through many experiments • Still no guarantee for new queries/documents • Language models make it possible to estimate parameters…
The Way to Automatic Tuning ... • Parameters must be PART of the model! • Query modeling (explain difference in query) • Document modeling (explain difference in doc) • De-couple the influence of a query on parameter setting from that of documents • To achieve stable setting of parameters • To pre-compute query-independent parameters
Parameter Setting in Risk Minimization Estimate Estimate Query model parameters Set User model parameters Doc model parameters Query Language Model Query User Loss Function Documents Document Language Models
Generative Relevance Hypothesis [Lavrenko 04] • Generative Relevance Hypothesis: • For a given information need, queries expressing that need and documents relevant to that need can be viewed as independent random samples from the same underlying generative model • A special case of risk minimization when document models and query models are in the same space • Implications for retrieval models: “the same underlying generative model” makes it possible to • Match queries and documents even if they are in different languages or media • Estimate/improve a relevant document model based on example queries or vice versa
Risk minimization can easily go beyond independent relevance…
Aspect Retrieval Query: What are the applications of robotics in the world today? Find as many DIFFERENT applications as possible. Aspect judgments A1 A2 A3 … ... Ak d1 1 1 0 0 … 0 0 d2 0 1 1 1 … 0 0 d3 0 0 0 0 … 1 0 …. dk 1 0 1 0 ... 0 1 Example Aspects: A1: spot-welding robotics A2: controlling inventory A3: pipe-laying robots A4: talking robot A5: robots for loading & unloading memory tapes A6: robot [telephone] operators A7: robot cranes … … Must go beyond independent relevance!
Evaluation Measures #doc 1 2 3 … … #asp 2 5 8 … … #uniq-asp 2 4 5 AC: 2/1=2.0 4/2=2.0 5/3=1.67 AU: 2/2=1.0 4/5=0.8 5/8=0.625 Accumulated counts • Aspect Coverage (AC): measures per-doc coverage • #distinct-aspects/#docs • Equivalent to the “set cover” problem, NP-hard • Aspect Uniqueness(AU): measures redundancy • #distinct-aspects/#aspects • Equivalent to the “volume cover” problem, NP-hard • Examples 0 0 0 1 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 … ... d1 d2 d3
Dependent Relevance Ranking • In general, the computation of the optimal ranking is NP-hard • A general greedy algorithm • Pick the first document according to INDEPENDENT relevance • Given that we have picked k documents, evaluate the CONDITIONAL relevance of each candidate document • Choose the document that has the highest conditional relevance value
Loss Function L( k+1| 1 … k) Maximal Marginal Relevance (MMR) 1 Novelty/Redundancy Nov ( k+1| 1 … k) The best dk+1 is novel & relevant k Relevance Rel( k+1) ? dk+1 k+1 Maximal Diverse Relevance (MDR) Aspect Coverage Distrib. p(a|i) 1 The best dk+1 is complementary in coverage k k+1 known d1 … dk
Maximal Marginal Relevance (MMR) Models • Maximizing aspect coverage indirectly through redundancy elimination • Conditional-Rel. = novel + relevant • Elements • Redundancy/Novelty measure • Combination of novelty and relevance
A Mixture Model for Redundancy Ref. document Maximum Likelihood Expectation-Maximization P(w|Old) Collection P(w|Background) =? 1-
Cost-based Combination of Relevance and Novelty Relevance score Novelty score
Maximal Diverse Relevance (MDR) Models • Maximizing aspect coverage directly through aspect modeling • Conditional-rel. = complementary coverage • Elements • Aspect loss function • Generative Aspect Model
Aspect Generative Model of Document & Query U q User Query d Document S Source PLSI: LDA: =(1,…, k)
Aspect Loss Function U q S d
Aspect Loss Function: Illustration perfect redundant “Already covered” p(a|1)... p(a|k -1) non-relevant New candidate p(a|k) Combined coverage Desired coverage p(a|Q)
Risk Minimization: Summary • Risk minimization is a general probabilistic retrieval framework • Retrieval as a decision problem (=risk min.) • Separate/flexible language models for queries and docs • Advantages • A unified framework for existing models • Automatic parameter tuning due to LMs • Allows for modeling complex retrieval tasks • Lots of potential for exploring LMs… • For more information,see [Zhai 02]
Future Research Directions • Modeling latent structures of documents • Introduce source structures (naturally suggest structure-based smoothing methods) • Modeling multiple queries and clickthroughs of the same user • Let the observation include multiple queries and clickthroughs • Collaborative search • Introduce latent interest variables to tie similar users together • Modeling interactive search
Axiomatic Retrieval Framework Most of the following slides are from Hui Fang’s presentation
Traditional Way of Modeling the Relevance Vector Space Models [Salton et al.75, Salton et al. 83, Salton et al. 89, Singhal96] QRep Rel≈Sim(DRep,QRep) Rel≈P(R=1|DRep,QRep) DRep Probabilistic Models [Fuhr et al 92, Lafferty et al 03, Ponte et al 98, Robertson et al. 76, Turtle et al. 91, Rijbergen et al 77] test collection Query Relevance? Document • No way to predict the performance and identify the weaknesses • Sophisticated parameter tuning
Sophisticated Parameter Tuning “k1, b and k3 are parameters which depend on the nature of the queries and possibly on the database; k1 and b default to 1.2 and 0.75 respectively, but smaller values of b are sometimes advantageous; in long queries k3 is often set to 7 or 1000.” [Robertson et al. 1999]
Hui Fang’s Thesis Work [Fang 07] Propose a novel axiomatic framework, where relevance is directly modeled with term-based constraints • Predict the performance of a function analytically [Fang et al., SIGIR04] • Derive more robust and effective retrieval functions [Fang & Zhai, SIGIR05, Fang & Zhai, SIGIR06] • Diagnose weaknesses and strengths of retrieval functions [Fang & Zhai, under review]
Traditional Way of Modeling the Relevance Query QRep Vector Space Models Rel≈Sim(DRep,QRep) Relevance? Rel≈P(R=1|DRep,QRep) Probabilistic Models Document DRep test collection
Axiomatic Approach to Relevance Modeling Constraint 1 (1) Predict performance Constraint 2 We are here … Constraint m Rel(Q,D) (2) Develop more robust functions (3) Diagnose weaknesses Collection (constraint 1) Collection (constraint 2) Collection (constraint m) … Query QRep Vector Space Models Rel≈Sim(DRep,QRep) Relevance? Rel≈P(R=1|DRep,QRep) Probabilistic Models Document DRep test collection
Part 1: Define retrieval constraints[Fang et. al. SIGIR 2004]
Inversed Document Frequency • Pivoted Normalization Method • Dirichlet Prior Method • Okapi Method 1+ln(c(w,d)) Parameter sensitivity Document Length Normalization Alternative TF transformation Term Frequency Empirical Observations in IR (Cont.)
Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can we predict the empirical behavior of a method without experimentation?
Term Frequency Constraints (TFC1) Let q be a query with only one term w. w q : If d1: and d2: then TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. • TFC1
Term Frequency Constraints (TFC2) w1 w2 q: d1: d2: TF weighting heuristic II: Favor a document with more distinct query terms. • TFC2 Let q be a query and w1, w2be two query terms. Assume and If and then