The Principle of Information Retrieval

The Principle of Information Retrieval Department of Information ManagementSchool of Information EngineeringNanjing University of Finance & Economics 2011

II 课程内容

6 Query expansion and relevance feedback

Query refining • Query refining • Query expansion • Query reformulation • Why use query refining? • Synonymy • Personalization • …

Two types • Global methods • Expanding or reformulating query terms independent of the query • Local methods • Adjusting a query relative to the documents that initially appear to match the query

The types of Global methods • Query expansion/reformulation with a thesaurus • WordNet • Automatic thesaurus generation • Techniques like spelling correction

The type of Local methods • Relevance feedback • Pseudo-relevance feedback, also known as Blind relevance feedback • (Global) indirect relevance feedback

6.1 Global methods for query reformulation

6.1.1 Query reformulation • Users give additional input on query words or phrases, which possibly suggested by the IR system • The key is building a thesaurus for query reformulation • Use of a controlled vocabulary that is maintained by human editors • Library of Congress Subject Headings • The Dewey Decimal system • An automatically derived thesaurus with word co-occurrence statistics • Query reformulations based on query log mining

6.1.2 Methods of query reformulation • Vocabulary tools • Automatic thesaurus generation

6.1.2.1 Vocabulary tools for query reformulation • By means of a thesaurus or a controlled vocabulary • This includes information about • Words that were omitted from the query • Words that were stemmed to • The number of hits on each term or phrase • Whether words were dynamically turned into phrases

WordNet

Sogou vocabulary

The advantage • Not requiring any user input • Some system can do automatic query expansion with thesaurus • In PubMed system, neoplasm was added to a search for cancer automaticly • Increases recall • Widely used in many science and engineering fields

6.1.2.2 Automatic thesaurus generation

Automatically generated thesaurus

Methods • Exploit word cooccurrence with text statistics to find the most similar words • Feasible and common • Use a grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies • Advanced but complicated

Computation of co-occurrence thesaurus • We begin with a term-document matrix A, where each cell At,d is a weighted count wt,d for term t and document d • If we then calculate C = AAT, then Cu,v is a similarity score between terms u and v, with a larger number being better

Computation of co-occurrence thesaurus

The disadvantages • Tremendous computation • Require dimensionality reduction via Latent Semantic Indexing • Require domain specific thesaurus • The quality of the associations • Term ambiguity easily introduces irrelevant statistically correlated terms • Apple computer may expand to Apple red fruit computer • Not retrieve many additional documents • Since the terms in the automatic thesaurus are highly correlated in documents anyway

6.2 Relevance feedback • Relevance feedback is one of the most used and most successful approaches

The base idea • It may be difficult to formulate a good query when you don’t know the collection well • Seeing some documents may lead users to refine their understanding of the information they are seeking

The approach of RF • The user issues a (short, simple) query • The system returns an initial set of retrieval results • The user marks some returned documents as relevant or not relevant • The system computes a better representation of the information need based on the user feedback • The system displays a revised set of retrieval results

An example of RF • Image search provides a good example of relevance feedback, which is a domain where a user can easily have difficulty formulating what they want in words, but can easily indicate relevant or nonrelevant images • http://nayana.ece.ucsb.edu/imsearch/imsearch.html

Instructions • Browse: If the first page displayed doesn't include any interesting images, click browse to see the next page • Search: Once you find some initial images you are interested, click on them to select and press search • Iterate: After the search results are displayed, select/unselect more relevant images and click search • The system is based on relevance feedback and it learns while you select more images and iterate

6.2.1 The Rocchio algorithm for RF • Theclassic algorithm • The models based on VSM • Relevance feedback can improve both recall and precision • But, in practice, it has been shown to be most useful for increasing recall

The underlying theory1-5 • We want to find a query vector, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents

The underlying theory2-5 • If Cr is the set of relevant documents and Cnr is the set of nonrelevant documents, then we wish to find:

The underlying theory3-5 • Under cosine similarity, the optimal query vector for separating the relevant and nonrelevant documents is:

The underlying theory4-5

The underlying theory5-5 • The optimal query is the vector difference between the centroids of the relevant and nonrelevant documents • The key is getting the full set of relevant documents and the full set of nonrelevant documents based on users’ feedback

Rocchio algorithm • q0 is the original query vector • Dr and Dnr are the set of known relevant and nonrelevant documents respectively • α,β and γ are weights attached to each term

Rocchio algorithm • Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems set γ < β • Reasonable values might be α = 1, β = 0.75, and γ = 0.15

Ide dec-hi • Another alternative is to use only the marked nonrelevant document, which has the most consistent perform

The first assumption of RF • The user has to have sufficient knowledge to be able to make an initial query which is at least somewhere close to the documents they desire • Cases where relevance feedback alone is not sufficient include: • Misspellings • Mismatch of searcher’s vocabulary versus collection vocabulary • Laptop VS. notebook computer • Cross-language information retrieval • Documents in the same language cluster more closely together

The second assumption of RF1-3 • The term distribution in all relevant documents will be similar to that in the documents marked by the users • The term distribution in all nonrelevant documents will be different from those in relevant documents

The second assumption of RF2-3 • This approach does not work well if the relevant documents are a multimodal class • Subsets of the documents using different vocabulary, such as Burma vs. Myanmar • A query for which the answer set is inherently disjunctive, such as Pop stars who once worked at Burger King • Instances of a general concept, which often appear as a disjunction of more specific concepts • For example, felines

The Principle of Information Retrieval

The Principle of Information Retrieval

Presentation Transcript

Information Retrieval

Information retrieval

The Mathematics of Information Retrieval

Information Retrieval

Information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

The Mathematics of Information Retrieval

Information Retrieval

The Principle of Information Retrieval

information retrieval

Information Retrieval