220 likes | 245 Views
CS 430: Information Discovery. Lecture 25 Query Refinement. Course Administration. No class next Tuesday, December 3. Information Discovery Tasks. Text Retrieval • provide good ranking for a query Text Classification • classify documents by their semantic content Information Extraction
E N D
CS 430: Information Discovery Lecture 25 Query Refinement
Course Administration No class next Tuesday, December 3
Information Discovery Tasks Text Retrieval • provide good ranking for a query Text Classification • classify documents by their semantic content Information Extraction • extract particular attributes from a document Topic Detection and Tracking • find and track new topics in a stream of documents From Lecture 24
Basic Search: The Human in the Loop Return objects Return hits Browse repository Search index
Basic Query: Query Refinement Query formulation and search Display summary results no hits Reformulate query or display new query Display retrieved information new query reformulate query
Reformulation of Query Manual • Add or remove search terms • Change Boolean operators • Change phrases, regular expressions, etc. Automatic • Remove search terms • Change weighting of search terms • Add new search terms, e.g. from a thesaurus Machine learning may be effective in developing an optimal query.
Query Reformulation: Vocabulary Tools User Interface Feedback • Information about stop lists, stemming, etc. • Numbers of hits on each term or phrase Suggestions • Thesaurus • Browse lists of terms in the inverted index • Controlled vocabulary
Query Reformulation: Document Tools Feedback usually consists of document excerpts or surrogates • Suggests to the user how the system has interpreted the query Effective at suggesting how to restrict a search • Shows examples of false hits Less good at suggesting how to expand a search • No examples of missed items
Relevance Feedback Concept • User specifies an initial query and carries out search. • System returns results set (or most highly ranked hits). • User examines results set and indicates which are relevant. • System uses this information to generate an improved query and repeats search.
Relevance Feedback (concept) hits from original search x x o x x o o x documents identified as non-relevant o documents identified as relevant original query reformulated query From Lecture 4
Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface
Theoretically Best Query optimal query o x x x o o x x x x x x x o x x x o x o x x x x x non-relevant documents o relevant documents
Theoretically Best Query For a specific query, Q, let: DRbe the set of all relevant documents DN-Rbe the set of all non-relevant documents sim(Q, DR) be the mean similarity between query Q and documents in DR sim(Q, DN-R) be the mean similarity between query Q and documents in DN-R The theoretically best query would maximize: F = sim(Q, DR) - sim(Q, DN-R)
Estimating the Best Query In practice, DRand DN-Rare not known. (The objective is to find them.) However, the results of an initial query can be used to estimate sim(Q, DR) and sim(Q, DN-R).
Rocchio's Modified Query Modified query vector = Original query vector + Mean vector of relevant documents found by original query - Mean vector of non-relevant documents found by original query
Query Modification Q1 = Q0 + Ri - Si n1 n2 i =1 i =1 1 n1 1 n2 Q0 = vector for the initial query Q1 = vector for the modified query Ri = vector for relevant document i Si = vector for non-relevant document i n1 = number of relevant documents n2 = number of non-relevant documents Rocchio 1971
Positive and Negative Feedback Q1 = Q0 + Ri - Si n1 n2 i =1 i =1 1 n1 1 n2 , and are weights that adjust the importance of the three vectors. If = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set.
Difficulties with Relevance Feedback optimal query o x Hits from the initial query are contained in the gray shaded area x x o o x x x x x x x o x x x o x o x x x x x non-relevant documents o relevant documents original query reformulated query
Effectiveness of Relevance Feedback Best when: • Relevant documents are tightly clustered (similarities are large) • Similarities between relevant and non-relevant documents are small
When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: • Formulate queries thoughtfully with many terms • Review results carefully to provide feedback • Iterate several times • Combine automatic query enhancement with studies of thesauruses and other manual enhancements
Profile A profile is a stored query that is run at regular intervals. Example. A physicist provides a list of keywords describing research interests. A query is run against arXiv.org once a month to identify new articles that match that query. Since a profile is run many times, it is worth investing effort to have the best possible query.
Profile: Training Data Relevance feedback and machine learning methods both need feedback from the user(s) about which hits are useful. Profiles provide such data: • Click through data -- observe which hits the user downloads or otherwise follows up • Direct user feedback -- user provides feedback manually