210 likes | 219 Views
Explore advanced query operations in information retrieval including user relevance feedback, automatic local and global analysis, clustering techniques, and the use of thesaurus to improve search results. Discover techniques to enhance query effectiveness.
E N D
Special Topics in Computer ScienceThe Art of Information RetrievalChapter 5: Query Operations Alexander Gelbukh www.Gelbukh.com
Previous chapter: Conclusions • Query languages (width-wide): • words, phrases, proximity, fuzzy Boolean, natural language • Query languages (depth-wide): • Pattern matching • If return sets, can be combined using Boolean model • Combining with structure • Hierarchical structure • Standardized low level languages: protocols • Reusable
Previous chapter: Trends and research topics • Models: to better understand the user needs • Query languages: flexibility, power, expressiveness, functionality • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
Query operations • Users have difficulties formulating queries • Program improves the query • Interactive mode: using the user’s feedback • Using info from the retrieved set • Using linguistic information or information from the collection • Query expansion • add new terms • Term rewriting • modify weights
1st method: User relevance feedback • User examines to 10 (20) docs and marks relevant ones • System uses this to construct new query • Moved toward relevant docs • Away from irrelevant • Good: simplicity Note: In all the chapter, the correct spelling is Rocchio
User relevance feedback:Vector Space Model Best vector to distinguish good from bad docs:avg good minus avg bad
User relevance feedback:Vector Space Model • Equally good results • Original query gives important info: • Relevant docs give more info than irrelevant ones: < • = 0: Positive feedback
User relevance feedback:Probabilistic Model • User feedback: • Smoothing is usually applied • Bad: • No document weights • Previous history lost • No new terms, only weights are changed
... a variant for Probabilistic Model • Similarity is multiplied by TF (term frequency) • Not exactly, but this is the idea • Initially, IDF is also taken into account • Details in the book • Still no query expansion, only re-weighting the original terms
Evaluation of Relevance Feedback • Simplistic: • Evaluate precision and recall after the feedback cycle • Not realistic since includes the user’s own feedback • Better: • Only consider unseen data • Use the rest of the collection • Not as good figures • Useful to compare different methods, not to compare precision/recall before and after feedback
2nd method: Automatic local analysis • Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationships • Based on clustering technoques • Global vs Local strategy: • Global: the whole collection is used for this • Local: the retrieved set. Similar to feedback, but automatic. • Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming. • Good for local collections, not for Web • Build clusters of words; add to each keyword its neighbors
Clustering (words) • Association clusters • Terms that co-occur in the docs • The clusters are the n terms that occur most frequently together with the query terms (normalized vs. non-) • Metric clusters (better) • Multiplies the number of co-occurrences by the proximity in the text • Terms that occur in the same sentence are more related • Scalar clusters • Terms co-occurring with the same other terms are related • Relatedness of two words = scalar product of centroids of their association clusters
... variant (local clustering) Metric-like reasoning: • Break the retrieved docs into passages (say, 300 words) • Use them as docs; use TF-IDF • Choose words related (use TF-IDF) to the whole query • Better: words occuring near each other are more related • Tune for each collection , not 5:
3rd Method: Automatic Global Analysis • Uses all docs in the collection • Builds a thesaurus • The terms related to the whole query are added (query expansion)
Similarity thesaurus • Relatedness = occur in the same docs. • Matrix doc x term frequency • Inverse term frequency: divided by the size of the doc • Relatedness = correlation between rows of the matrix • Query: centroid, weighted (weighted sum). • Relatedness between a term and this centroid = cosine • Add best terms are added to the query, with weights:
(Global) Statistical thesaurus... • Terms added must be discriminative low frequency • Difficult to cluster (no info) • Solution: First cluster docs; the frequency increases • Clustering docs, e.g.: • Each doc is a cluster • Merge two most similar clusters = their docs are similar • Repeat until <condition> page 136:
... statistical thesaurus • Convert the cluster hierarchy into a set of clusters • Use a threshold similarity level to cut the hierarchy • Don’t take too large clusters • Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same class • threshold • These give clusters of words • Calculate weight of each class of terms. Add these terms with this weight to the query terms
Research topics • Interactive interfaces • Graphical, 2D or 3D • Refining global analysis techniques • Application of linguistics methods. Stemming. Ontologies • Local analysis for the Web (now too expensive) • Combine the tree techniques (feedback, local, global)
Conclusions • Relevance feedback • Simple, understandable • Needs user attention • Term re-weighting • Local analysis for query expansion • Co-occurrences in the retrieved docs • Usually gives better results than global analysis • Computationally expensive • Global analysis • Not as good results, since what is good for the whole collection is not good for a specific query • Linguistic methods, dictionaries, ontologies, stemming, ...
Exam • Questions and exercises • You do what you consider appropriate • On Oct 23 or maybe Nov 6 (??), discuss • The class on Oct 30 is moved to Oct 23
The class of Oct 30 is moved to 23 Thank you! Till October 23 October 23:discussion of the midterm exam,class moved from October 30