Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations

Special Topics in Computer ScienceThe Art of Information RetrievalChapter 5: Query Operations Alexander Gelbukh www.Gelbukh.com

Previous chapter: Conclusions • Query languages (width-wide): • words, phrases, proximity, fuzzy Boolean, natural language • Query languages (depth-wide): • Pattern matching • If return sets, can be combined using Boolean model • Combining with structure • Hierarchical structure • Standardized low level languages: protocols • Reusable

Previous chapter: Trends and research topics • Models: to better understand the user needs • Query languages: flexibility, power, expressiveness, functionality • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Query operations • Users have difficulties formulating queries • Program improves the query • Interactive mode: using the user’s feedback • Using info from the retrieved set • Using linguistic information or information from the collection • Query expansion • add new terms • Term rewriting • modify weights

1st method: User relevance feedback • User examines to 10 (20) docs and marks relevant ones • System uses this to construct new query • Moved toward relevant docs • Away from irrelevant • Good: simplicity Note: In all the chapter, the correct spelling is Rocchio

User relevance feedback:Vector Space Model Best vector to distinguish good from bad docs:avg good minus avg bad

User relevance feedback:Vector Space Model • Equally good results • Original query gives important info:  • Relevant docs give more info than irrelevant ones:  <  •  = 0: Positive feedback

User relevance feedback:Probabilistic Model • User feedback: • Smoothing is usually applied • Bad: • No document weights • Previous history lost • No new terms, only weights are changed

... a variant for Probabilistic Model • Similarity is multiplied by TF (term frequency) • Not exactly, but this is the idea • Initially, IDF is also taken into account • Details in the book • Still no query expansion, only re-weighting the original terms

Evaluation of Relevance Feedback • Simplistic: • Evaluate precision and recall after the feedback cycle • Not realistic since includes the user’s own feedback • Better: • Only consider unseen data • Use the rest of the collection • Not as good figures • Useful to compare different methods, not to compare precision/recall before and after feedback

2nd method: Automatic local analysis • Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationships • Based on clustering technoques • Global vs Local strategy: • Global: the whole collection is used for this • Local: the retrieved set. Similar to feedback, but automatic. • Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming. • Good for local collections, not for Web • Build clusters of words; add to each keyword its neighbors

Clustering (words) • Association clusters • Terms that co-occur in the docs • The clusters are the n terms that occur most frequently together with the query terms (normalized vs. non-) • Metric clusters (better) • Multiplies the number of co-occurrences by the proximity in the text • Terms that occur in the same sentence are more related • Scalar clusters • Terms co-occurring with the same other terms are related • Relatedness of two words = scalar product of centroids of their association clusters

... variant (local clustering) Metric-like reasoning: • Break the retrieved docs into passages (say, 300 words) • Use them as docs; use TF-IDF • Choose words related (use TF-IDF) to the whole query • Better: words occuring near each other are more related • Tune for each collection , not 5:

3rd Method: Automatic Global Analysis • Uses all docs in the collection • Builds a thesaurus • The terms related to the whole query are added (query expansion)

Similarity thesaurus • Relatedness = occur in the same docs. • Matrix doc x term frequency • Inverse term frequency: divided by the size of the doc • Relatedness = correlation between rows of the matrix • Query: centroid, weighted (weighted sum). • Relatedness between a term and this centroid = cosine • Add best terms are added to the query, with weights:

(Global) Statistical thesaurus... • Terms added must be discriminative  low frequency • Difficult to cluster (no info) • Solution: First cluster docs; the frequency increases • Clustering docs, e.g.: • Each doc is a cluster • Merge two most similar clusters = their docs are similar • Repeat until <condition> page 136:

... statistical thesaurus • Convert the cluster hierarchy into a set of clusters • Use a threshold similarity level to cut the hierarchy • Don’t take too large clusters • Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same class • threshold • These give clusters of words • Calculate weight of each class of terms. Add these terms with this weight to the query terms

Research topics • Interactive interfaces • Graphical, 2D or 3D • Refining global analysis techniques • Application of linguistics methods. Stemming. Ontologies • Local analysis for the Web (now too expensive) • Combine the tree techniques (feedback, local, global)

Conclusions • Relevance feedback • Simple, understandable • Needs user attention • Term re-weighting • Local analysis for query expansion • Co-occurrences in the retrieved docs • Usually gives better results than global analysis • Computationally expensive • Global analysis • Not as good results, since what is good for the whole collection is not good for a specific query • Linguistic methods, dictionaries, ontologies, stemming, ...

Exam • Questions and exercises • You do what you consider appropriate • On Oct 23 or maybe Nov 6 (??), discuss • The class on Oct 30 is moved to Oct 23

The class of Oct 30 is moved to 23 Thank you! Till October 23 October 23:discussion of the midterm exam,class moved from October 30

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations

Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations

Presentation Transcript

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

Special Topics on Information Retrieval

Special Topics on Information Retrieval

Special Topics on Information Retrieval

Special Topics on Information Retrieval

Information Retrieval - Query expansion

Special Topics on Information Retrieval

Modern Information Retrieval Chapter 5 Query Operations

Information Retrieval - Query expansion

Chapter 5: Query Operations

Special Topics on Information Retrieval

Chapter 5: Query Operations

Chapter 5 Query Operations

Special Topics on Information Retrieval

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Information Retrieval - Query expansion

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations

Special Topics on Information Retrieval

Chapter 5: Query Operations

Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages