410 likes | 636 Views
Introduction to Information Retrieval (Part 2). By Evren Ermis. Introduction to Information Retrieval. Retrieval models Vector-space-model Probabilistic model Relevance feedback Evaluation Performance evaluation Retrieval Performance evaluation Reference Collections Evaluation measures.
E N D
Introduction to Information Retrieval(Part 2) By Evren Ermis
Introduction to Information Retrieval • Retrieval models • Vector-space-model • Probabilistic model • Relevance feedback • Evaluation • Performance evaluation • Retrieval Performance evaluation • Reference Collections • Evaluation measures
Vector-space-model • Binary weights are too limiting • Non-binary weights to index terms • In querie • In documents • Compute the degree of similarity • Sorting in order of similarity allows considering documents which match partially
Vector-space-model • Considering every document as vector • Similarity by correlation between vectors
Vector-space-model • Not predicting wether relevant or not • But ranking according to similarity • Document can be retrieved although matches the querie only partially • Use threshold d to filter documents with similarity < d
Vector-space-model Index term weights • features that better describe the seeked documents: intra-cluster similarity • distinguish the seeked documents from the rest: inter-cluster dissimilarity
Vector-space-model Index term weights • Intra-cluster similarity • Inter-cluster dissimilarity
Vector-space-model Index term weights • The weight of a term in a document is then calculated as product of the tf factor and the idf factor • Or for the query
Vector-space-model • Advantages • Improves retrieval performance • Partial matching allowed • Sort according to similarity • Disadvantages • Assumes that index terms are independent
Probabilistic model • Assuming that there is a set of documents, containing exactly the relevant documents and no other (ideal answer set) • Problem is that we don‘t know that set‘s properties • Index terms to characterize the properties • Use a initial guess at query time to receive a probabilistic discription of the ideal answer set • Use this to retrieve a first set of documents • Interaction with user to improve probabilistic discription of ideal answer set
Probabilistic model • Interaction with user to improve probabilistic discription of ideal answer set • The probabilistic approach is to model the description in probabilistic terms without the user • Problem: Don‘t know how to compute the probabilties of relevance
Probabilistic model • how to compute the probabilties of relevance • As measure of similarity • P(dj relevant-to q)/P(dj non-relevant-to q) • Odds of document dj being relevant to query q • So using similarity function:
Probabilistic model • Problem: we don‘t have the set R at the beginning • Necessary to find initial probabilities • Make two assumptions: • P(kj|R) is constant for all index terms • Distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all documents
Probabilistic model • So we get: • Now we can retrieve documents containing query terms and provide initial probabilistic ranking for them
Probabilistic model • Now we can use these retrieved documents to improve our assumed probabilities • Let V be a subset of the retrieved documents and Vi a subset of V containing the i-th index term, then:
Probabilistic model • Advantages: • Documents are ranked in decreasing order of their probability being relevant • Disadvantages: • Need guess for initial separation of relevant and non-relevant documents • Does not consider frequence of occurences of index term in a document
Relevance feedback • Query reformulation strategy • User depicts relevant documents out of the retrieval • Method selects important terms attached to the user-identified documents • Enhances new gained information in a new query formulation and reweighting of the terms
Relevance feedback for vector model • vectors of relevant documents have similarity among themselves • non-relevant documents have vectors that are dissimilar to the relevant ones • Reformulate the query such that it gets closer to term-weight vector space of the relevant documents
Relevance feedback for probabilistic model • Replacing V by Dr and Vi by Dr,i, whereas Dr set of user chosen documents, and Dr,i is the subset of Dr containing the index term ki.
Relevance feedback for probabilistic model • Using this replacement and rewriting the similarity function for probabilistic model we get: • Reweighting of the index terms already in the query • Not expanding the query by new index terms
Relevance feedback for probabilistic model • Advantages: • Feedback directly related toderivation of new weights • Reweighting is optimal under assumptions of • term independence • Binary document indexing • Disadvantages: • Document term weights not regarded in feedback loop • Previous term weights in query disregarded • No query expansion • Not as effectively as vector modification method
Evaluation Types of evaluation: • Performance of the system(time and space) • Functional analysis in which the specified system functionalities are tested • How precise is the answer set • Reference collection • Evaluation measure
Performance Evaluation • Performance of the indexing structures • Interaction with the operating system • Delays in communication channels • Overheads introduced by the many software layers
Retrieval performance evaluation • Reference collection consists of • collection of documents • Set of example information requests • Set of relevant documents for each request • Evaluation measure • Uses reference collection • Quantifies the similarity between the documents retrieved by a retrieval strategy and the provided set of relevant documents
Reference collection • Exist several different reference collection • TIPSTER/TREC • CACM • CISI • Cystic Fibrosis • etc. • Choose TIPSTER/TREC for further discussion
TIPSTER/TREC • conference „Text Retrieval Conference“ • Built under the TIPSTER program • Large test collection (over 1 million documents) • For each conference a set of reference experiments is designed • Research groups use these to compare their retrieval systems
Evaluation measure • Exist several different evaluation measures • Recall and precision • Average precision • Interpolated precision • Harmonic mean( F-measure ) • E-measure • Satisfaction, Frustation, etc. • Choose Recall and precision as the most used ones for further discussion
Recall and precision • Definitions for recall: • Recall is the fraction of relevant documents which has been retrieved. • And precision: • Precision is the fraction of retrieved documents which is relevant.
Precision vs. Recall • Assume that all documents in A have been examined • But user is not confronted with all docs • Instead sorted according to relevance • Recall and precision vary as the user proceeds examination of docs • Proper evaluation requires precision vs. recall curve
Average precision • Example figure for one query • To evaluate the retrieval algorithm have to run several distinct queries • Get distinct precision vs. recall curves • Average the precision figures at each recall level
Interpolated precision • Recall levels for each query distinct from 11 standard recall levels • Interpolation procedure is necessary • Let rj be the j-th standard recall level with j=1,2,…,10. Then,
Harmonic Mean( F-measure ) • Harmonic mean defined as: • F high if recall and precision high • Therefore maximum F interpreted as best compromise between recall and precision
E-measure • User specifies if more interest in recall or precision • E-measure defined as: • b is user specified and reflects relative importance of recall and precision
Conclusion • Introduced two most popular models for information retreival: • Vector space model • Probabilistic model • Introduced evaluation methods to quantify performance of Information Retrieval Systems ( Recall and Precision, … )
References • Baeza-Yates: „Modern Information Retrieval“ (1999) • G.Salton: „The Smart Retrieval System – Experiments in Automatic Document Processing“ (1971) • S.E.Roberston, K.Spark Jones: Relevance weighting of search terms – Journal of American Society for Information Sciences (1976) • N.Fuhr: „Probabilistic model in information retrieval“ (1992) • TREC NIST website: http://trec.nist.gov • J.J.Rocchio: Relevance feedback in information retrieval (1971)