Information Retrieval

Information Retrieval CSE 8337 Spring 2003 Query Operations Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Prof. Raymond J. Mooney in CS378 at University of Texas Introduction to Modern Information Retrieval by Gerald Salton and Michael J. McGill, 1983, McGraw-Hill. Automatic Text Processing, by Gerard Salton, Addison-Wesley,1989.

Operations TOC • Introduction • Relevance Feedback • Query Expansion • Term Reweighting • Automatic Local Analysis • Query Expansion using Clustering • Automatic Global Analysis • Query Expansion using Thesaurus • Similarity Thesaurus • Statistical Thesaurua • Complete Link Algorithm

Query Operations Introduction • IR queries as stated by the user may not be precise or effective. • There are many techniques to improve a stated query and then process that query instead.

Relevance Feedback • Use assessments by users as to the relevance of previously returned documents to create new (modify old) queries. • Technique: • Increase weights of terms from relevant documents. • Decrease weight of terms from nonrelevant documents. • Figure 10.4 in Automatic Text Processing • Figure 6-10 in Introduction to Modern Information Retrieval

Relevance Feedback • After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents. • Use this feedback information to reformulate the query. • Produce new results based on reformulated query. • Allows more interactive, multi-pass process.

Query String Revised Query ReRanked Documents 1. Doc1 2. Doc2 3. Doc3 . . 1. Doc2 2. Doc4 3. Doc5 . . 1. Doc1  2. Doc2  3. Doc3  . . Ranked Documents Query Reformulation Feedback Relevance Feedback Architecture Document corpus IR System Rankings

Query Reformulation • Revise query to account for feedback: • Query Expansion: Add new terms to query from relevant documents. • Term Reweighting: Increase weight of terms in relevant documents and decrease weight of terms in irrelevant documents. • Several algorithms for query reformulation.

Query Reformulation for VSR • Change query vector using vector algebra. • Add the vectors for the relevant documents to the query vector. • Subtract the vectors for the irrelevant docs from the query vector. • This both adds both positive and negatively weighted terms to the query as well as reweighting the initial terms.

Optimal Query • Assume that the relevant set of documents Cr are known. • Then the best query that ranks all and only the relevant queries at the top is: Where N is the total number of documents.

Standard Rochio Method • Since all relevant documents unknown, just use the known relevant (Dr) and irrelevant (Dn) sets of documents and include the initial query q. : Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant documents.

Ide Regular Method • Since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback: : Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant documents.

Ide “Dec Hi” Method • Bias towards rejecting just the highest ranked of the irrelevant documents: : Tunable weight for initial query. : Tunable weight for relevant documents. : Tunable weight for irrelevant document.

Comparison of Methods • Overall, experimental results indicate no clear preference for any one of the specific methods. • All methods generally improve retrieval performance (recall & precision) with feedback. • Generally just let tunable constants equal 1.

Fair Evaluation of Relevance Feedback • Remove from the corpus any documents for which feedback was provided. • Measure recall/precision performance on the remaining residual collection. • Compared to complete corpus, specific recall/precision numbers may decrease since relevant documents were removed. • However, relative performance on the residual collection provides fair data on the effectiveness of relevance feedback. • Fig 10.5 in Automatic Text Processing

Evaluating Relevance Feedback • Test-and-control Collection • Divide document collection in two parts • Use test portion to perform relevance feedback and to modify query • Perform test on control portion using both original and modified query • Compare results

Why is Feedback Not Widely Used? • Users sometimes reluctant to provide explicit feedback. • Results in long queries that require more computation to retrieve, and search engines process lots of queries and allow little time for each one. • Makes it harder to understand why a particular document was retrieved.

Pseudo Feedback • Use relevance feedback methods without explicit user input. • Just assume the top m retrieved documents are relevant, and use them to reformulate the query. • Allows for query expansion that includes terms that are correlated with the query terms.

PseudoFeedback Results • Found to improve performance on TREC competition ad-hoc retrieval task. • Works even better if top documents must also satisfy additional boolean constraints in order to be used in feedback.

Term Reweighting for PM • Use statistics found in retrieved documents • Dr – Set of relevant and retrieved • Dr,i – Set of relevant and retrieved that contain ki.

Term Reweighting • No query expansion • Document term weights not used • Query term weights not used • Therefore, not usually as effective as previous vector approach.

Local vs. Global Automatic Analysis • Local – Documents retrieved are examined to automatically determine query expansion. No relevance feedback needed. • Global – Thesaurus used to help select terms for expansion.

Automatic Local Analysis • At query time, dynamically determine similar terms based on analysis of top-ranked retrieved documents. • Base correlation analysis on only the “local” set of retrieved documents for a specific query. • Avoids ambiguity by determining similar (correlated) terms only within relevant documents. • “Apple computer”  “Apple computer Powerbook laptop”

Automatic Local Analysis • Expand query with terms found in local clusters. • Dl – set of documents retireved for query q. • Vl – Set of words used in Dl. • Sl – Set of distinct stems in Vl. • fsi,j –Frequency of stem si in document dj found in Dl. • Construct stem-stem association matrix.

w1 w2 w3 …………………..wn c11 c12 c13…………………c1n w1 w2 w3 . . wn c21 c31 . . cn1 Association Matrix cij: Correlation factor between stems siand stem sj fik: Frequency of term i in document k

Normalized Association Matrix • Frequency based correlation factor favors more frequent terms. • Normalize association scores: • Normalized score is 1 if two stems have the same frequency in all documents.

Metric Correlation Matrix • Association correlation does not account for the proximity of terms in documents, just co-occurrence frequencies within documents. • Metric correlations account for term proximity. Vi: Set of all occurrences of term i in any document. r(ku,kv): Distance in words between word occurrences kuand kv ( if kuand kv are occurrences in different documents).

Normalized Metric Correlation Matrix • Normalize scores to account for term frequencies:

Query Expansion with Correlation Matrix • For each term i in query, expand query with the n terms, j, with the highest value of cij(sij). • This adds semantically related terms in the “neighborhood” of the query terms.

Problems with Local Analysis • Term ambiguity may introduce irrelevant statistically correlated terms. • “Apple computer”  “Apple red fruit computer” • Since terms are highly correlated anyway, expansion may not retrieve many additional documents.

Automatic Global Analysis • Determine term similarity through a pre-computed statistical analysis of the complete corpus. • Compute association matrices which quantify term correlations in terms of how frequently they co-occur. • Expand queries with statistically most similar terms.

Automatic Global Analysis • There are two modern variants based on a thesaurus-like structure built using all documents in collection • Query Expansion based on a Similarity Thesaurus • Query Expansion based on a Statistical Thesaurus

Thesaurus • A thesaurus provides information on synonyms and semantically related words and phrases. • Example: physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbones rel: medic, general practitioner, surgeon,

Thesaurus-based Query Expansion • For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus. • May weight added terms less than original query terms. • Generally increases recall. • May significantly decrease precision, particularly with ambiguous terms. • “interest rate”  “interest rate fascinate evaluate”

Similarity Thesaurus • The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. • This relationship are not derived directly from co-occurrence of terms inside documents. • They are obtained by considering that the terms are concepts in a concept space. • In this concept space, each term is indexed by the documents in which it appears. • Terms assume the original role of documents while documents are interpreted as indexing elements

Similarity Thesaurus • The following definitions establish the proper framework • t: number of terms in the collection • N: number of documents in the collection • fi,j: frequency of occurrence of the term ki in the document dj • tj: vocabulary of document dj • itfj: inverse term frequency for document dj

Similarity Thesaurus • Inverse term frequency for document dj • To ki is associated a vector

Similarity Thesaurus • where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows

Similarity Thesaurus • The relationship between two terms ku and kv is computed as a correlation factor c u,v given by • The global similarity thesaurus is built through the computation of correlation factor cu,v for each pair of indexing terms [ku,kv] in the collection

Similarity Thesaurus • This computation is expensive • Global similarity thesaurus has to be computed only once and can be updated incrementally

Query Expansion based on a Similarity Thesaurus • Query expansion is done in three steps as follows: • Represent the query in the concept space used for representation of the index terms • Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. • Expand the query with the top r ranked terms according to sim(q,kv)

Query Expansion - step one • To the query q is associated a vector q in the term-concept space given by • where wi,q is a weight associated to the index-query pair[ki,q]

Query Expansion - step two • Compute a similarity sim(q,kv) between each term kv and the user query q • where cu,v is the correlation factor

Query Expansion - step three • Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ • To each expansion term kv in the query q’ is assigned a weight wv,q’ given by • The expanded query q’ is then used to retrieve new documents to the user

Query Expansion Sample • Doc1 = D, D, A, B, C, A, B, C • Doc2 = E, C, E, A, A, D • Doc3 = D, C, B, B, D, A, B, C, A • Doc4 = A • c(A,A) = 10.991 • c(A,C) = 10.781 • c(A,D) = 10.781 • ... • c(D,E) = 10.398 • c(B,E) = 10.396 • c(E,E) = 10.224

Query Expansion Sample • Query: q = A E E • sim(q,A) = 24.298 • sim(q,C) = 23.833 • sim(q,D) = 23.833 • sim(q,B) = 23.830 • sim(q,E) = 23.435 • New query: q’ = A C D E E • w(A,q')= 6.88 • w(C,q')= 6.75 • w(D,q')= 6.75 • w(E,q')= 6.64

WordNet • A more detailed database of semantic relationships between English words. • Developed by famous cognitive psychologist George Miller and a team at Princeton University. • About 144,000 English words. • Nouns, adjectives, verbs, and adverbs grouped into about 109,000 synonym sets called synsets.

WordNet Synset Relationships • Antonym: front  back • Attribute: benevolence  good (noun to adjective) • Pertainym: alphabetical  alphabet (adjective to noun) • Similar: unquestioning  absolute • Cause: kill  die • Entailment: breathe  inhale • Holonym: chapter  text (part-of) • Meronym: computer  cpu (whole-of) • Hyponym: tree  plant (specialization) • Hypernym: fruit  apple (generalization)

WordNet Query Expansion • Add synonyms in the same synset. • Add hyponyms to add specialized terms. • Add hypernyms to generalize a query. • Add other related terms to expand query.

Statistical Thesaurus • Existing human-developed thesauri are not easily available in all languages. • Human thesuari are limited in the type and range of synonymy and semantic relations they represent. • Semantically related terms can be discovered from statistical analysis of corpora.

Query Expansion Based on a Statistical Thesaurus • Global thesaurus is composed of classes which group correlated terms in the context of the whole collection • Such correlated terms can then be used to expand the original user query • This terms must be low frequency terms • However, it is difficult to cluster low frequency terms • To circumvent this problem, we cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. • This algorithm must produce small and tight clusters.

Information Retrieval