Modeling and Solving Term Mismatch for Full-Text Retrieval

Modeling and SolvingTerm Mismatch for Full-Text Retrieval Dissertation Presentation Le ZhaoLanguage Technologies InstituteSchool of Computer ScienceCarnegie Mellon University July 26, 2012 Committee: Jamie Callan (Chair) Jaime Carbonell Yiming Yang Bruce Croft (UMass)

What is Full-Text Retrieval • The task • The Cranfieldevaluation [Cleverdon 1960] • abstracts away the user, • allows objective & automatic evaluations Results Retrieval Engine User User Query Document Collection

Where are We (Going)? • Current retrieval models • formal models from 1970s, best ones 1990s • based on simple collection statistics (tf.idf), no deep understanding of natural language texts • Perfect retrieval • Query: “information retrieval”, A: “… text search …” • Textual entailment (difficult natural language task) • Searcher frustration [Feild, Allan and Jones 2010] • Still far away, what have been holding us back? imply

Two Long Standing Problems in Retrieval • Term mismatch • [Furnas, Landauer, Gomez and Dumais 1987] • No clear definition in retrieval • Relevance (query dependent term importance – P(t | R)) • Traditionally, idf (rareness) • P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998] • Few clues about estimation • This work • connects the two problems, • shows they can result in huge gains in retrieval, • and uses a predictive approach toward solving both problems.

What is Term Mismatch & Why Care? • Job search • You look for information retrievaljobs on the market. They want text search skills. • cost you job opportunities, (50% even if you are careful) • Legal discovery • You look for bribery or foul playin corporate documents.They say grease, pay off. • cost you cases • Patent/Publication search • cost businesses • Medical record retrieval • cost lives

Prior Approaches • Document: • Full text indexing • Instead of only indexing key words • Stemming • Include morphological variants • Document expansion • Inlink anchor, user tags • Query: • Query expansion, reformulation • Both: • Latent Semantic Indexing • Translation based models

Main Questions Answered • Definition • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution

Definition of Mismatch P(t | Rq) _ • Collection Directly calculated given relevance judgments for q Relevant (q) Jobs mismatched All relevant jobs • Documents that contain t “retrieval” _ mismatch (P(t| Rq)) == 1 –term recall (P(t| Rq)) [CIKM 2010]

How Often do Terms Match? (Example TREC-3 topics)

Main Questions • Definition • P(t | R) or P(t| R), simple, • estimated from relevant documents, • analyze mismatch • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution _

Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • Term weight for Okapi BM25 • Other advanced models behave similarly • Used as effective features in Web search engines • Idf (rareness) Term recall

Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance • Idf (rareness) Term recall

Main Questions • Definition • Significance • Theory (as idf & only part about relevance) • Practice? • Mechanism (what causes the problem) • Model and solution

Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance • Idf (rareness) Term recall

Without Term Recall • The emphasis problem for tf.idf retrieval models • Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206)

Ground Truth (Term Recall) Query: prognosis/viability of a political third party Emphasis Wrong Emphasis

Top Results (Language model) Query: prognosis/viability of a political third party 1. … discouraging prognosis for 1991 … 2. … Politics … party … Robertson's viability as a candidate … 3. … political parties … 4. … there is no viable opposition … 5. … A third of the votes … 6. … politics … party… two thirds … 7. … third ranking political movement… 8. … political parties … 9. … prognosis for the Sunday school … 10. … third party provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!)

Without Term Recall • The emphasis problem for tf.idf retrieval models • Emphasize high idf(rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) • False positives throughout rank list • especially detrimental at top rank • No term recall hurts precision at all recall levels • How significant is the emphasis problem?

Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today

Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior: Personalization, WSD, structured • Mechanism (what causes the problem) • Model and solution

Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

True Term Recall Effectiveness • +100% over BIM (in precision at all recall levels) • [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) • This work • For a new query w/o relevance judgments, • Need to predict • Predictions don’t need to be very accurate to show performance gain

Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior, • +30 to 80% potential from term weighting • Mechanism (what causes the problem) • Model and solution

How Often do Terms Match? Same term, different Recall (Examples from TREC 3 topics) Varies 0 to 1 Differs from idf

Statistics Term recall across all query terms (average ~55-60%) TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall

Statistics Term recall on shorter queries (average ~70%) TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall

Statistics Query dependent (but for many terms, variance is small) Term Recall for Repeating Terms 364 recurring words from TREC 3-7, 350 topics

P(t | R) vs. idf P(t | R) P(t | R) vs. df/N (Greiff, 1998) df/N TREC 4 desc query terms

Prior Prediction Approaches • Croft/Harper combination match (1979) • treats P(t | R)as a tuned constant, or estimated from PRF • when >0.5, rewards docs that match more query terms • Greiff’s (1998) exploratory data analysis • Used idf to predict overall term weighting • Improved over basic BIM • Metzler’s (2008) generalized idf • Used idf to predict P(t | R) • Improved over basic BIM • Simple feature (idf), limited success • Missing piece: P(t | R) = term recall = 1 – term mismatch

What Factors can Cause Mismatch? • Topic centrality (Is concept central to topic?) • “Laser research related or potentially related to defense” • “Welfare laws propounded as reforms” • Synonyms (How often they replace original term?) • “retrieval” == “search” == … • Abstractness • “Laser research … defense”“Welfare laws” • “Prognosis/viability” (rare & abstract)

Main Questions • Definition • Significance • Mechanism • Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms • Model and solution

Designing Features to Model the Factors • We need to • Identify synonyms/searchonyms of a query term • in a query dependent way • External resource? (WordNet, wiki, or query log) • Biased (coverage problem, collection independent) • Static (not query dependent) • Not easy, not used here • Term-term similarity in concept space! • Local LSI (Latent Semantic Indexing) Results Concept Space (150 dim) Retrieval Engine Query Top (500) Results Results Document Collection

Synonyms from Local LSI P(t| Rq)

Synonyms from Local LSI P(t| Rq) (1) Magnitude of self similarity – Term centrality (2) Avg similarity of supporting terms – Concept centrality (3) How likely synonyms replace term t in collection

Features that Model the Factors Correlation with P(t | R) 0.3719 • idf: – 0.1339 • Term centrality • Self-similarity (length of t) after dimension reduction • Concept centrality • Avg similarity of supporting terms (top synonyms) • Replaceability • How frequently synonyms appear in place of original query term in collection documents • Abstractness • Users modify abstract terms with concrete terms 0.3758 – 0.1872 – 0.1278 effects on the US educational program prognosis of a political third party

Prediction Model Regression modeling • Model: M: <f1, f2, .., f5>  P(t | R) • Train on one set of queries (known relevance), • Test on another set of queries (unknown relevance) • RBF kernel Support Vector regression

A General View of Retrieval Modeling as Transfer Learning • The traditional restricted view sees a retrieval model as • a document classifier for a given query. • More general view: A retrieval model really is • a meta-classifier, responsible for many queries, • mapping a query to a document classifier. • Learning a retrieval model == transfer learning • Using knowledge from related tasks (training queries) to classify documents for a new task (test query) • Our features and model facilitate the transfer. • More general view  more principled investigations and more advanced techniques

Experiments • Term recall prediction error • L1 loss (absolute prediction error) • Term recall based term weighting retrieval • Mean Average Precision (overall retrieval success) • Precision at top 10 (precisionat top of rank list)

Term Recall Prediction Example Query: prognosis/viability of a political third party. (Trained on TREC 3) Emphasis

Term Recall Prediction Error The lower, the better L1 Loss:

Main Questions • Definition • Significance • Mechanism • Model and solution • Can be predicted;Framework to design and evaluate features

Using (t | R) in Retrieval Models • In BM25 • Binary Independence Model • In Language Modeling (LM) • Relevance Model [Lavrenko and Croft 2001] Only term weighting, no expansion.

Predicted Recall Weighting 10-25% gain (MAP) Recall LM desc Datasets: train -> test “*”: significantly better by sign & randomization tests

Predicted Recall Weighting 10-20% gain(top Precision) Recall LM desc Datasets: train -> test “*”: Prec@10 is significantly better. “!”: Prec@20 is significantly better.

vs. Relevance Model Relevance Model [Laverenko and Croft 2001] Term occurrence in top docs Unsupervised Query Likelihood y RM weight (x) ~ Term recall (y) Pm(t1| R) Pm(t2 | R) ~ P(t1 | R) ~ P(t2 | R) 5-10% better than unsupervised x

Main Questions • Definition • Significance • Mechanism • Model and solution • Term weighting solves emphasis problem for long queries • Mismatch problem?

Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

Recap: Term Mismatch • Term mismatch ranges 30%-50% on average • Relevance matching can degrade quickly for multi-word queries • Solution: Fix every query term [SIGIR 2012]

Conjunctive Normal Form (CNF) Expansion Example keyword query: placement of cigarette signs on television watched by children  Manual CNF: (placement OR place OR promotion OR logo OR sign OR signage OR merchandise)AND(cigarette OR cigar OR tobacco)AND(televisionOR TV OR cable OR network)AND(watch OR view)AND(children OR teen OR juvenile OR kid OR adolescent) • Expressive & compact (1 CNF == 100s alternatives) • Highly effective (this work: 50-300% over base keyword) • Used by lawyers, librarians and other expert searchers • But, tedious & difficult to create, little research

Diagnostic Intervention Query:placement of cigarette signs on television watched by children • Goal • Least amount user effort  near optimal performance • E.g. expand 2 terms  90% of total improvement Low terms Diagnosis: High idf (rare) terms placement of cigarette signs on television watched bychildren placement of cigarette signs on television watched by children Expansion: CNF CNF (placementOR placeOR promotionOR sign OR signage OR merchandise)AND cigarAND televisionAND watchAND(childrenORteen OR juvenile OR kidOR adolescent) (placementOR placeOR promotionOR sign OR signage OR merchandise)AND cigar AND (televisionOR tvOR cable OR network) AND watchAND children

Modeling and Solving Term Mismatch for Full-Text Retrieval

Modeling and Solving Term Mismatch for Full-Text Retrieval

Presentation Transcript

CS276A Text Retrieval and Mining

Information Retrieval and Text Mining

Problem Solving and Modeling

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Text Retrieval Algorithms

Automatic Term Mismatch Diagnosis for Selective Query Expansion

CS276A Text Retrieval and Mining

Inverted Indexing for Text Retrieval

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Term Weighting approaches in automatic text retrieval.

Text-retrieval Systems

Short-term Retrieval

Risk Minimization and Language Modeling in Text Retrieval

Modeling and Solving Term Mismatch for Full-Text Retrieval

Information Retrieval and Text Mining

CS276A Text Retrieval and Mining

Modeling and Solving Constraints

Text retrieval systems