160 likes | 358 Views
Modern Information Retrieval: A Brief Overview. By Amit Singhal Ranjan Dash. Layout. History Models & Implementations Evaluation Key Techniques Term Weighting Query Modification Other Techniques and Applications Conclusion. History. Starts from 3000BC with Sumerians
E N D
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash
Layout • History • Models & Implementations • Evaluation • Key Techniques • Term Weighting • Query Modification • Other Techniques and Applications • Conclusion
History • Starts from 3000BC with Sumerians • The major IR developments starts in 1950s and 1960s • 1950s – Vannevar Bush, Luhn • 1960s – • SMART system – Gerald Salton • Cranfield Evaluation – Cyril Cleverdon • 1970s & 1980s – • Various models for document retrieval on small text collection • 1992 • TREC – Text Retrieval Conference • Other fields like retrieval of spoken information, non-English language retrieval, info filtering, • Modern Textual IR – WWW search 1996 - 1998
Models & Implementations • IR systems • Boolean systems • Ranked Retrieval Systems • Models • Vector space model • Probabilistic Model • Inference Network Model • Implementation
Models & Implementations.. Vector space model • Every word in vocabulary as independent dimension • Document or query as vectors in this high dimensional space • Positive quadrant of vector space • Numeric similarity between query vector and document vector – cosine of the angle between them.
Models & Implementations.. Probabilistic Model – Probabilistic Ranking Principle(PRP) • Ranked by decreasing probability of their relevance to a query • Maron and Kuhn - 1960 • Probability of relevance for doc D P(R|D)= = =
Models & Implementations.. Assumptions:
Models & Implementations.. • Inference Network Model • Inference process in an inference network • A document instantiates a term with a certain strength and credit from multiple terms is accumulated • Strength of instantiation of a term – weight • Document ranking for this model = Vector space or probabilistic models
Models & Implementations.. • Implementation • Inverted list • Stop words • Stemming – little effective for English, effective for language with many word inflections – German • Multiword phrases • Techniques to generate list of phrases – linguistic, statistical
Evaluation • Objective evaluation • Cranfield Tests • Characteristics for search effectiveness – • Recall – proportion of relevant documents retrieved by the system • Precision – proportion of the retrieved documents that are relevant • Average Precision – averaging precisions at different recall points
Key Techniques • Term weight • Term frequency – • Raw tf – non optimal • Dampened tf ( logarithmic tf) – better one • Okapi weighting • Pivoted normalization weighting • Document frequency • Document length • Query modification/expansion via relevance feedback
Key Techniques • Query modification/expansion • Adding synonyms – lack of query context • Relevance feedback – Rocchio in 1965 • User judgment to modify the query • Quite effective • Pseudo-feedback for short user query • Top few docs retrieved by initial user query are ‘relevant’ and does relevance feedback to generate a new query
Other Techniques and Applications • Cluster Hypothesis – Documents that cluster together have similar relevance profile for a query • Natural Language Processing ( NLP ) – • Not so effective for IR • Other IR fields besides doc ranking • Information Filtering (IF), Topic Detection and Tracking ( TDT), Speech Retrieval, Cross-language retrieval
Conclusion • 40 yrs of experience for IR • Statistical techniques are the BEST