1 / 19

A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics. Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA. Empirical Observations in IR. Retrieval heuristics are necessary for good retrieval performance.

Download Presentation

A Formal Study of Information Retrieval Heuristics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign USA

  2. Empirical Observations in IR • Retrieval heuristics are necessary for good retrieval performance. • E.g. TF-IDF weighting, document length normalization • Similar formulas may have different performances. • Performance is sensitive to parameter setting.

  3. Inversed Document Frequency • Pivoted Normalization Method • Dirichlet Prior Method • Okapi Method 1+ln(c(w,d)) Parameter sensitivity Document Length Normalization Alternative TF transformation Term Frequency Empirical Observations in IR (Cont.)

  4. Research Questions • How can we formally characterize these necessary retrieval heuristics? • Can we predict the empirical behavior of a method without experimentation?

  5. Outline • Formalized heuristic retrieval constraints • Analytical evaluation of the current retrieval formulas • Benefits of constraint analysis • Better understanding of parameter optimization • Explanation of performance difference • Improvement of existing retrieval formulas

  6. Let q be a query with only one term w. w q : If d1: and d2: then Term Frequency Constraints (TFC1) TF weighting heuristic I: Give a higher score to a document with more occurrences of a query term. • TFC1

  7. Let q be a query and w1, w2be two query terms. w1 w2 q: Assume and d1: If and d2: then Term Frequency Constraints (TFC2) TF weighting heuristic II: Favor a document with more distinct query terms. • TFC2

  8. Doc 1 Doc 2 ... … SVM SVM Tutorial Tutorial … … … SVM SVM Tutorial Tutorial … Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) SVMTutorial

  9. w1 w2 q: Let q be a query and w1, w2be two query terms. d1: Assume d2: and and for all other words w. If and then Term Discrimination Constraint (Cont.) • TDC

  10. LNC1 q: Let q be a query. d1: If for some word d2: but for other words then • LNC2 q: Let q be a query. If and d1: d2: then Length Normalization Constraints(LNCs) Document length normalization heuristic: Penalize long documents(LNC1); Avoid over-penalizing long documents (LNC2) .

  11. Let q be a query with only one term w. w q: If d1: d2: and then TF-LENGTH Constraint (TF-LNC) TF-LN heuristic: Regularize the interaction of TF and document length. • TF-LNC

  12. Analytical Evaluation

  13. Query: SVM Tutorial Assume IDF(SVM)>IDF(Tutorial) Doc 1 ... … SVM SVM SVM Tutorial Tutorial … Term Discrimination Constraint (TDC) IDF weighting heuristic: Penalize the words popular in the collection; Give higher weights to discriminative terms. Doc 2 … … Tutorial SVM SVM Tutorial Tutorial …

  14. Benefits of Constraint Analysis • Provide an approximate bound for the parameters • A constraint may be satisfied only if the parameter is within a particular interval. • Compare different formulas analytically without experimentations • When a formula does not satisfy the constraint, it often indicates non-optimality of the formula. • Suggest how to improve the current retrieval models • Violation of constraints may pinpoint where a formula needs to be improved.

  15. Optimal s (for average precision) Parameter sensitivity of s Avg. Prec. 0.4 s Benefits 1 : Bounding Parameters LNC2  s<0.4 • Pivoted Normalization Method

  16. Negative when df(w) is large  Violate many constraints keyword query verbose query Avg. Prec Avg. Prec Okapi Pivoted s or b s or b Benefits 2 : Analytical Comparison • Okapi Method

  17. Modified Okapi verbose query keyword query Avg. Prec. Avg. Prec. Okapi Pivoted s or b s or b Benefits 3: Improving Retrieval Formulas • Modified Okapi Method Make Okapi satisfy more constraints; expected to help verbose queries

  18. Conclusions and Future Work • Conclusions • Retrieval heuristics can be captured through formally defined constraints. • It is possible to evaluate a retrieval formula analytically through constraint analysis. • Future Work • Explore additional necessary heuristics • Apply these constraints to many other retrieval methods • Develop new retrieval formulas through constraint analysis

  19. The End Thank you!

More Related