CS 430: Information Discovery

CS 430: Information Discovery Lecture 25 Query Refinement

Course Administration No class next Tuesday, December 3

Information Discovery Tasks Text Retrieval • provide good ranking for a query Text Classification • classify documents by their semantic content Information Extraction • extract particular attributes from a document Topic Detection and Tracking • find and track new topics in a stream of documents From Lecture 24

Basic Search: The Human in the Loop Return objects Return hits Browse repository Search index

Basic Query: Query Refinement Query formulation and search Display summary results no hits Reformulate query or display new query Display retrieved information new query reformulate query

Reformulation of Query Manual • Add or remove search terms • Change Boolean operators • Change phrases, regular expressions, etc. Automatic • Remove search terms • Change weighting of search terms • Add new search terms, e.g. from a thesaurus Machine learning may be effective in developing an optimal query.

Query Reformulation: Vocabulary Tools User Interface Feedback • Information about stop lists, stemming, etc. • Numbers of hits on each term or phrase Suggestions • Thesaurus • Browse lists of terms in the inverted index • Controlled vocabulary

Query Reformulation: Document Tools Feedback usually consists of document excerpts or surrogates • Suggests to the user how the system has interpreted the query Effective at suggesting how to restrict a search • Shows examples of false hits Less good at suggesting how to expand a search • No examples of missed items

Relevance Feedback Concept • User specifies an initial query and carries out search. • System returns results set (or most highly ranked hits). • User examines results set and indicates which are relevant. • System uses this information to generate an improved query and repeats search.

Relevance Feedback (concept)   hits from original search x x o  x x o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query From Lecture 4

Document Vectors as Points on a Surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

Theoretically Best Query  optimal query o x x x o o x x x x x x x o x x x o x o x x x x x non-relevant documents o relevant documents

Theoretically Best Query For a specific query, Q, let: DRbe the set of all relevant documents DN-Rbe the set of all non-relevant documents sim(Q, DR) be the mean similarity between query Q and documents in DR sim(Q, DN-R) be the mean similarity between query Q and documents in DN-R The theoretically best query would maximize: F = sim(Q, DR) - sim(Q, DN-R)

Estimating the Best Query In practice, DRand DN-Rare not known. (The objective is to find them.) However, the results of an initial query can be used to estimate sim(Q, DR) and sim(Q, DN-R).

Rocchio's Modified Query Modified query vector = Original query vector + Mean vector of relevant documents found by original query - Mean vector of non-relevant documents found by original query

Query Modification Q1 = Q0 + Ri - Si n1 n2   i =1 i =1 1 n1 1 n2 Q0 = vector for the initial query Q1 = vector for the modified query Ri = vector for relevant document i Si = vector for non-relevant document i n1 = number of relevant documents n2 = number of non-relevant documents Rocchio 1971

Positive and Negative Feedback Q1 =  Q0 + Ri -  Si n1 n2   i =1 i =1 1 n1 1 n2 ,  and  are weights that adjust the importance of the three vectors. If  = 0, the weights provide positive feedback, by emphasizing the relevant documents in the initial set. If  = 0, the weights provide negative feedback, by reducing the emphasis on the non-relevant documents in the initial set.

Difficulties with Relevance Feedback    optimal query o x Hits from the initial query are contained in the gray shaded area x x o o x x x x x x x o  x x x o x o x x x x x non-relevant documents o relevant documents  original query reformulated query

Effectiveness of Relevance Feedback Best when: • Relevant documents are tightly clustered (similarities are large) • Similarities between relevant and non-relevant documents are small

When to Use Relevance Feedback Relevance feedback is most important when the user wishes to increase recall, i.e., it is important to find all relevant documents. Under these circumstances, users can be expected to put effort into searching: • Formulate queries thoughtfully with many terms • Review results carefully to provide feedback • Iterate several times • Combine automatic query enhancement with studies of thesauruses and other manual enhancements

Profile A profile is a stored query that is run at regular intervals. Example. A physicist provides a list of keywords describing research interests. A query is run against arXiv.org once a month to identify new articles that match that query. Since a profile is run many times, it is worth investing effort to have the best possible query.

Profile: Training Data Relevance feedback and machine learning methods both need feedback from the user(s) about which hits are useful. Profiles provide such data: • Click through data -- observe which hits the user downloads or otherwise follows up • Direct user feedback -- user provides feedback manually

CS 430: Information Discovery