210 likes | 340 Views
Recall: Query Reformulation Approaches. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) Cluster based Query Expansion Local analysis: derive information from retrieved document set Global analysis: derive information from corpus.
E N D
Recall: Query Reformulation Approaches • Relevance feedback based • vector model (Rocchio …) • probabilistic model (Robertson & Sparck Jones, Croft…) • Cluster based Query Expansion • Local analysis: derive information from retrieved document set • Global analysis: derive information from corpus
Local Analysis • “Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents.” MIR • In relevance feedback, clusters are built from interaction with user about documents. • Local analysis automatically exploits the documents retrieved by identifying terms related to those in the query.
Term Clusters Association Clusters: model co-occurrence of stems in retrieved documents, expand using co-occurring terms • unnormalized groups by large frequencies • normalized groups by rarity Metric Clusters: factor in intra-document distance Problem: Expensive to compute on the fly
Global Analysis • All documents are analyzed for term relationships. • Two Approaches: Similarity thesaurus: relates whole query to new terms. Focus is on concept underlying terms: each term is indexed by the documents in which it appears. Statistical thesaurus:cluster documents into class hierarchy
Similarity Thesaurus Basis where inverse term frequency (itf) for doc dj is: N is the number of documents, t is number of distinct terms in collection and tj is number of distinct terms in document j
Similarity Thesaurus Creation • Thesaurus is a matrix of correlation factors between indexing terms:
Relationship between terms and Query from Qiu & Frei, “Concept Based Query Expansion”, SIGIR-93
Query Expansion w/Similarity Thesaurus • Represent the query in the concept space of the index terms (weight vector) • Based on the global similarity thesaurus, compute a similarity sim(q,kv): • Expand the query with the top r ranked terms and weight with:
Global 2: Statistical Thesaurus • Thesaurus construction relies on high discrimination/low frequency terms. • Hard to cluster… • So, build classes based on clustering similar docs instead. • Similarity is minimum of cosine vector model similarity between any two docs (one from each cluster).
Complete Link Algorithm [Crouch & Yang] • Place each document in a distinct cluster. • Compute the similarity between all pairs of clusters. • Determine the pair of clusters [Cu,Cv] with the highest inter-cluster similarity. • Merge the clusters Cu and Cv • Verify a stop criterion. If this criterion is not met then go back to step 2. • Return a hierarchy of clusters.
Hierarchy Example Doc1=D,D,A,B,C,A,B,C Doc2=E,C,E,A,A,D Doc3=D,C,B,B,D,A,B,C,A Doc4=A from MIR notes
Query Expansion w/Statistical Thesaurus • Select the terms for each class: • Threshold on similarity determines which clusters • NDC determines max number of docs in cluster • MIDF determines minimum IDF for any term (i.e., how rare) • Compute thesaurus class weight for terms
Global Analysis Summary • Thesaurus approach has been effective for improving queries… • However • requires expensive processing (static corpus required) • statistical generation exploits small frequencies better but is sensitive to parameter settings.
Relevance feedback and query expansion approaches have been shown to be effective at improving relevance, sometimes at expense of precision. Users resist relevance feedback, takes time and understanding. Query reformulation can be costly (expensive computation) for search engines/IR systems. Relevance Feedback/Query Reformulation Summary
Search Engine Use of Query Feedback • Relevance feedback • explicit tried, but mostly abandoned. • indirect: Teoma (ranks documents higher that users look at more often) • Similar/Related Pages or searches: • suggest expanded queries or ask to search for related pages (Altavista and MSN Search used to do this) • Google- Find Similar • Teoma • Web log data mining
Behavior-Based Ranking • AskJeeves used user behavior to change results ranking: • For each query Q, record which URLs are followed • Use click through counts to order URLs for subsequent submissions of Q • Pseudo-relevance feedback
Teoma: Indirect Relevance • Combines indirect relevancy judgments with their own link analysis • “Subject-Specific Popularity ranks a site based on the number of same-subject specific pages that reference it.” [Teoma.com page] • Clustering Usage: Refine: Models communities to suggest search classification Resources: Suggests authoritative sites within designated community
Web Log Mining • SOP for large search engines to monitor what people are querying • Goals: • learn associations between common terms based on large number of queries • Identify trends in user behavior that should be addressed by system