820 likes | 831 Views
Explore query expansion, reformulation, and relevance feedback in information retrieval. Learn global & local methods for refining queries with thesaurus, synonyms, and user feedback.
E N D
The Principle of Information Retrieval Department of Information ManagementSchool of Information EngineeringNanjing University of Finance & Economics 2011
Query refining • Query refining • Query expansion • Query reformulation • Why use query refining? • Synonymy • Personalization • …
Two types • Global methods • Expanding or reformulating query terms independent of the query • Local methods • Adjusting a query relative to the documents that initially appear to match the query
The types of Global methods • Query expansion/reformulation with a thesaurus • WordNet • Automatic thesaurus generation • Techniques like spelling correction
The type of Local methods • Relevance feedback • Pseudo-relevance feedback, also known as Blind relevance feedback • (Global) indirect relevance feedback
6.1.1 Query reformulation • Users give additional input on query words or phrases, which possibly suggested by the IR system • The key is building a thesaurus for query reformulation • Use of a controlled vocabulary that is maintained by human editors • Library of Congress Subject Headings • The Dewey Decimal system • An automatically derived thesaurus with word co-occurrence statistics • Query reformulations based on query log mining
6.1.2 Methods of query reformulation • Vocabulary tools • Automatic thesaurus generation
6.1.2.1 Vocabulary tools for query reformulation • By means of a thesaurus or a controlled vocabulary • This includes information about • Words that were omitted from the query • Words that were stemmed to • The number of hits on each term or phrase • Whether words were dynamically turned into phrases
The advantage • Not requiring any user input • Some system can do automatic query expansion with thesaurus • In PubMed system, neoplasm was added to a search for cancer automaticly • Increases recall • Widely used in many science and engineering fields
Methods • Exploit word cooccurrence with text statistics to find the most similar words • Feasible and common • Use a grammatical analysis of the text and to exploit grammatical relations or grammatical dependencies • Advanced but complicated
Computation of co-occurrence thesaurus • We begin with a term-document matrix A, where each cell At,d is a weighted count wt,d for term t and document d • If we then calculate C = AAT, then Cu,v is a similarity score between terms u and v, with a larger number being better
The disadvantages • Tremendous computation • Require dimensionality reduction via Latent Semantic Indexing • Require domain specific thesaurus • The quality of the associations • Term ambiguity easily introduces irrelevant statistically correlated terms • Apple computer may expand to Apple red fruit computer • Not retrieve many additional documents • Since the terms in the automatic thesaurus are highly correlated in documents anyway
6.2 Relevance feedback • Relevance feedback is one of the most used and most successful approaches
The base idea • It may be difficult to formulate a good query when you don’t know the collection well • Seeing some documents may lead users to refine their understanding of the information they are seeking
The approach of RF • The user issues a (short, simple) query • The system returns an initial set of retrieval results • The user marks some returned documents as relevant or not relevant • The system computes a better representation of the information need based on the user feedback • The system displays a revised set of retrieval results
An example of RF • Image search provides a good example of relevance feedback, which is a domain where a user can easily have difficulty formulating what they want in words, but can easily indicate relevant or nonrelevant images • http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Instructions • Browse: If the first page displayed doesn't include any interesting images, click browse to see the next page • Search: Once you find some initial images you are interested, click on them to select and press search • Iterate: After the search results are displayed, select/unselect more relevant images and click search • The system is based on relevance feedback and it learns while you select more images and iterate
6.2.1 The Rocchio algorithm for RF • Theclassic algorithm • The models based on VSM • Relevance feedback can improve both recall and precision • But, in practice, it has been shown to be most useful for increasing recall
The underlying theory1-5 • We want to find a query vector, that maximizes similarity with relevant documents while minimizing similarity with nonrelevant documents
The underlying theory2-5 • If Cr is the set of relevant documents and Cnr is the set of nonrelevant documents, then we wish to find:
The underlying theory3-5 • Under cosine similarity, the optimal query vector for separating the relevant and nonrelevant documents is:
The underlying theory5-5 • The optimal query is the vector difference between the centroids of the relevant and nonrelevant documents • The key is getting the full set of relevant documents and the full set of nonrelevant documents based on users’ feedback
Rocchio algorithm • q0 is the original query vector • Dr and Dnr are the set of known relevant and nonrelevant documents respectively • α,β and γ are weights attached to each term
Rocchio algorithm • Positive feedback also turns out to be much more valuable than negative feedback, and so most IR systems set γ < β • Reasonable values might be α = 1, β = 0.75, and γ = 0.15
Ide dec-hi • Another alternative is to use only the marked nonrelevant document, which has the most consistent perform
The first assumption of RF • The user has to have sufficient knowledge to be able to make an initial query which is at least somewhere close to the documents they desire • Cases where relevance feedback alone is not sufficient include: • Misspellings • Mismatch of searcher’s vocabulary versus collection vocabulary • Laptop VS. notebook computer • Cross-language information retrieval • Documents in the same language cluster more closely together
The second assumption of RF1-3 • The term distribution in all relevant documents will be similar to that in the documents marked by the users • The term distribution in all nonrelevant documents will be different from those in relevant documents
The second assumption of RF2-3 • This approach does not work well if the relevant documents are a multimodal class • Subsets of the documents using different vocabulary, such as Burma vs. Myanmar • A query for which the answer set is inherently disjunctive, such as Pop stars who once worked at Burger King • Instances of a general concept, which often appear as a disjunction of more specific concepts • For example, felines