210 likes | 216 Views
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations. Alexander Gelbukh www.Gelbukh.com. Previous chapter: Conclusions. Query languages (width-wide): words, phrases, proximity, fuzzy Boolean, natural language Query languages (depth-wide):
E N D
Special Topics in Computer ScienceThe Art of Information RetrievalChapter 5: Query Operations Alexander Gelbukh www.Gelbukh.com
Previous chapter: Conclusions • Query languages (width-wide): • words, phrases, proximity, fuzzy Boolean, natural language • Query languages (depth-wide): • Pattern matching • If return sets, can be combined using Boolean model • Combining with structure • Hierarchical structure • Standardized low level languages: protocols • Reusable
Previous chapter: Trends and research topics • Models: to better understand the user needs • Query languages: flexibility, power, expressiveness, functionality • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
Query operations • Users have difficulties formulating queries • Program improves the query • Interactive mode: using the user’s feedback • Using info from the retrieved set • Using linguistic information or information from the collection • Query expansion • add new terms • Term rewriting • modify weights
1st method: User relevance feedback • User examines to 10 (20) docs and marks relevant ones • System uses this to construct new query • Moved toward relevant docs • Away from irrelevant • Good: simplicity Note: In all the chapter, the correct spelling is Rocchio
User relevance feedback:Vector Space Model Best vector to distinguish good from bad docs:avg good minus avg bad
User relevance feedback:Vector Space Model • Equally good results • Original query gives important info: • Relevant docs give more info than irrelevant ones: < • = 0: Positive feedback
User relevance feedback:Probabilistic Model • User feedback: • Smoothing is usually applied • Bad: • No document weights • Previous history lost • No new terms, only weights are changed
... a variant for Probabilistic Model • Similarity is multiplied by TF (term frequency) • Not exactly, but this is the idea • Initially, IDF is also taken into account • Details in the book • Still no query expansion, only re-weighting the original terms
Evaluation of Relevance Feedback • Simplistic: • Evaluate precision and recall after the feedback cycle • Not realistic since includes the user’s own feedback • Better: • Only consider unseen data • Use the rest of the collection • Not as good figures • Useful to compare different methods, not to compare precision/recall before and after feedback
2nd method: Automatic local analysis • Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationships • Based on clustering technoques • Global vs Local strategy: • Global: the whole collection is used for this • Local: the retrieved set. Similar to feedback, but automatic. • Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming. • Good for local collections, not for Web • Build clusters of words; add to each keyword its neighbors
Clustering (words) • Association clusters • Terms that co-occur in the docs • The clusters are the n terms that occur most frequently together with the query terms (normalized vs. non-) • Metric clusters (better) • Multiplies the number of co-occurrences by the proximity in the text • Terms that occur in the same sentence are more related • Scalar clusters • Terms co-occurring with the same other terms are related • Relatedness of two words = scalar product of centroids of their association clusters
... variant (local clustering) Metric-like reasoning: • Break the retrieved docs into passages (say, 300 words) • Use them as docs; use TF-IDF • Choose words related (use TF-IDF) to the whole query • Better: words occuring near each other are more related • Tune for each collection , not 5:
3rd Method: Automatic Global Analysis • Uses all docs in the collection • Builds a thesaurus • The terms related to the whole query are added (query expansion)
Similarity thesaurus • Relatedness = occur in the same docs. • Matrix doc x term frequency • Inverse term frequency: divided by the size of the doc • Relatedness = correlation between rows of the matrix • Query: centroid, weighted (weighted sum). • Relatedness between a term and this centroid = cosine • Add best terms are added to the query, with weights:
(Global) Statistical thesaurus... • Terms added must be discriminative low frequency • Difficult to cluster (no info) • Solution: First cluster docs; the frequency increases • Clustering docs, e.g.: • Each doc is a cluster • Merge two most similar clusters = their docs are similar • Repeat until <condition> page 136:
... statistical thesaurus • Convert the cluster hierarchy into a set of clusters • Use a threshold similarity level to cut the hierarchy • Don’t take too large clusters • Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same class • threshold • These give clusters of words • Calculate weight of each class of terms. Add these terms with this weight to the query terms
Research topics • Interactive interfaces • Graphical, 2D or 3D • Refining global analysis techniques • Application of linguistics methods. Stemming. Ontologies • Local analysis for the Web (now too expensive) • Combine the tree techniques (feedback, local, global)
Conclusions • Relevance feedback • Simple, understandable • Needs user attention • Term re-weighting • Local analysis for query expansion • Co-occurrences in the retrieved docs • Usually gives better results than global analysis • Computationally expensive • Global analysis • Not as good results, since what is good for the whole collection is not good for a specific query • Linguistic methods, dictionaries, ontologies, stemming, ...
Exam • Questions and exercises • You do what you consider appropriate • On Oct 23 or maybe Nov 6 (??), discuss • The class on Oct 30 is moved to Oct 23
The class of Oct 30 is moved to 23 Thank you! Till October 23 October 23:discussion of the midterm exam,class moved from October 30