Relevance Feedback and other Query Modification Techniques

Relevance Feedback and other Query Modification Techniques 課程名稱: 資訊擷取與推薦技術指導教授: 黃三益教授報告者: 博一楊錦生 (d9142801) 博一曾繁絹 (d9142803)

Introduction • Precision v.s. Recall • In case high recall ratio is critical to users, they have to retrieve more relevant documents. • Methods to retrieve more: • “Expand” their search by broadening a narrow Boolean query or looking further down a ranked list of retrieved documents. • Modify the original query.

Introduction (cont’d) • “Word Mismatch” problem: • Some of the unretrieved relevant documents are indexed by a different set of terms than those in the query or in most of the other relevant documents. • Approaches for improving the initial query: • Relevance Feedback • Automatic Query Modification

Conceptual Model of Relevance Feedback Query New Query Based on Result Set User Relevance Feedback Result Set

Basic Ideas about Relevance Feedback • Two components of relevance feedback: • Reweighting of query terms based on the distribution of these terms in the relevant and nonrelevant documents retrieved in response to those queries • Changing the actual terms in the query

Basic Ideas about Relevance Feedback (cont’d) • Evaluation of Relevance Feedback • The results after one iteration of feedback against those using no feedback generally show spectacular improvement • Another evaluation of the results is to compare only the residual collections

Basic approach to Relevance Feedback • Rocchio’s approach used the vector space model to rank documents

Ide developed three particular strategies extending Rocchio’s approach • Basic Roccho’s formula, minus the normalization for the number of relevant and nonrelevant documents • Allowed only feedback from relevant documents • Allowed limited negative feedback from only the highest-ranked nonrelevant document

Term reweighting without Query Expansion • A probabilistic model proposed by Robertson and Sparck Jones (1976) Wij = the term weight for term i in query j r = the number of relevant documents for query j having term i R = the total number of relevant documents for query j n = the number of documents in the collection having term i N = the number of documents in the collection

Term reweighting without Query Expansion (cont’d) • Croft (1983) extended this weighting scheme as below, initial search Feedback Wijk = the term weight for term I in query j and document k IDFi = the IDF weight for term I in the entire collection Pij = the probability that term i is assigned within the set of relevant documents for query j Qij = the probability that term i is assigned with the set of nonrelevant documents for query j Fik = K+(1-K)(freqik/maxfreqk) freqik=the frequency of term i in document k maxfreqk = the maximum frequency of any term in document k

Query Expansion • The query could be expanded by • offering users a selection of terms that are the terms most closely related to the initial query terms (thesaurus) • presenting users with a sorted list of terms from the relevant documents or all retrieved documents

Query Expansion (cont’d) • A proposed list of terms from relevant/nonrelevant documents using ranking methods • User selection from the top N terms • Automatically added to the query • The early SMART experiments both expanded the query and reweighted the query terms by adding the vectors of the relevant and nonrelevant documents.

Query Expansion (cont’d) • Modification of terms in relevant/nonrelevant documents: • Any relevant document(s) as a “new query” (Noreault, 1979) • If no relevant documents are indicated, the term list shown to the user is the list of related terms based on those previously sorted in the inverted file

Query Expansion with Term Reweighting • The vast amount of relevance feedback and query expansion research has been done using both query expansion and term-reweighting. • Three of most used feedback methods: • Ide Regular

Query Expansion with Term Reweighting(cont’d) • Ide dec-hi • Standard Rocchio Si = the top ranked non-relevant document

Automatic Query Modification • The major disadvantage of relevance feedback is that it increase the burden on the users [X97]. • Approaches for automatic query modification: • Local feedback • Automatic query expansion • Dictionary-based • Global analysis • Local analysis

Local Feedback • Local feedback is similar to relevance feedback. • Difference: assume the top ranked documents are relevant without human judgment. • It saves the costs of relevance judgment, but it can result in poor retrieval if the top ranked documents are non-relevant.

Automatic Query Expansion • Basic idea: • Expanding a user query using semantically similar and/or statistically associated terms with corresponding weights are added. • Thesauri are needed for similarity judgment. • Two approach for thesauri construction: • Manual thesauri • Automatic thesauri

Dictionary-based Query Expansion • Based on manual thesauri (e.g., WordNet [M95] ). • In expansion process, synonymous (or other semantic relations) words of initial query terms are selected and assigned each term a weight. • Disadvantage: • Construction of manual thesaurus requires a lot of human labor. • A general manual thesaurus does not consistently improve retrieval performance.

Example - WordNet

Automatic Thesauri Construction Approach • Thesauri are construction from the whole (a part of) the data corpus. • Basic idea of automatic thesauri construction: • Term co-occurrence • Methods of automatic thesauri construction: • Traditional TFxIDF [Y02] • Variant of TFxIDF (i.e., similarity thesaurus [QF93]) • Mining Association Rule Approach [WBO00]

Example of Thesaurus Construction • To each term ti is associated a vector: Where • The relationship between two terms tu and tv According to [QF93]

CRM Knowledge Discovery Text Mining 0.90 0.12 0.32 Data Warehouse 0.75 Data Mining 0.31 Decision Tree 0.56 0.50 0.50 Clustering Analysis 0.22 Classification Analysis 0.21 0.45 C4.5 Prediction Example of Thesaurus Construction (cont’d)

Global Analysis • The whole collection of documents is used for thesaurus creation. • Approaches: • Similarity Thesaurus [QF93] • Statistical Thesaurus [CY92]

Data Corpus Thesaurus Thesaurus Construction Initial User Query Query Expansion Expanded Query Relevant Documents Retrieve Global Analysis (cont’d)

Local Analysis • Unlike the global analysis, only the top ranked documents are used for constructing thesaurus. • Approaches: • Local Clustering [AF77] • Local Content Analysis [X97, XC96, XC00] • According to [XC96, X97, X00], local analysis is more effective than global analysis.

Top Ranked Documents Initial User Query 1st Retrieve Thesaurus Construction Query Expansion Expanded Query Relevant Documents 2nd Retrieve Local Analysis (cont’d)

References • [AF77] Attar, R. and Fraenkel, A. S., “Local Feedback in Full-Text Retrieval Systems,” Journal of the ACM, Volume 24, Issue 3, 1977, pp.397-417. • [BR99] Baeza-Yates, R, Ribeiro-Neto, B, Modern Information Retrieval, Addison Wesley/ACM Pres, Harlow, England, 1999. • [CY92] Crouch, C. J., Yang, B., "Experiments in Automatic Statistical Thesaurus Construction," Proceedings of the 15th Annual International ACM SIGIR Conference on Research and development in information retrieval, 1992, pp.77-88. • [M95] Miller, G. A, “WordNet: A Lexical Database for English,” Communications of the ACM, Vol. 38, No. 11, November 1995, pp.39- 41. • [QF93] Qiu, Y., Frei, H. P., "Concept Based Query Expansion," Proceedings of the 16th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, 1993, pp. 160-169. • [WBO00] Wei, J., Bressan, S., and Ooi, B. C., “Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results,”Proceedings of the First International Conference on Web Information Systems Engineering, Volume 1, 2000, pp. 366-373.

References (cont’d) • [X97] Xu, J., “Solving the Word Mismatch Problem Through Automatic Text Analysis,” PhD Thesis, University of Massachusetts at Amherst, 1997. • [XC96] Xu, J. and Croft, W. B., “Query Expansion Using Local and Global Document Analysis,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996, pp. 4-11. • [XC00] Xu, J. and Croft, W. B., “Improving the Effectiveness of Information Retrieval with Local Context Analysis,” ACM Transactions on Information Systems, Volume 18, Issue 1, 2000, pp. 79-112. • [Y02] Yang, C., “Investigation of Term Expansion on Text Mining Techniques,” Master Thesis, National Sun Yet-Sen University, Taiwan, 2002.

Relevance Feedback and other Query Modification Techniques