Information retrieval: overview

Information retrieval: overview

Information Retrieval and Text Processing • Huge literature dating back to the 1950’s! • SIGIR/TREC - home for much of this • Readings: • Salton, Wong, Yang, A Vector Space Model for Automatic Indexing, CACM Nov 75 V18N11 • Tutle, Croft, Inference Networks for Document Retrieval, ???, [OPTIONAL]

Search Filtering Summarization Classification Clustering Information extraction Knowledge management Author identification …and more... IR/TP applications

Types of search • Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC • Discovery -- finding “interesting” documents given a high-level goal • Classic IR search is focused on discovery

Classic discovery problem • Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles) • Problem: retrieve documents relevant to user’s information need

Classical search Task Conception Info Need Formulation Query Search Refinement Corpus Results

Definitions • Task: example: write a Web crawler • Information need: perception of documents needed to accomplish task, e.g., Web specs • Query: sequence of characters given to a search engine one hopes will return desired documents

Conception • Translating task into information need • Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task • Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically

Translation • Translating info need into query syntax of particular search engine • Mis-translation: get this wrong • Operator error (is “a b” == a&b or a|b ?) • Polysemy -- same word, different meanings • Synonimy -- different words, same meaning • Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”

Refinement • Modification of query, typically in light of particular results, to better meet info need • Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)

Precision and recall • Recall/precision trade-off: • Return everything ==> great recall, bad precision • Return nothing ==> great precision, bad recall • Precision curves • Search engine produces total ranking • Plot precision at 10%, 20%, .., 100% recall

Other metrics • Novelty / anti-redundancy • Information content of result set is disjoint • Comprehendible • Returned documents can be understood by user • Accurate / authoritative • Citation ranking!! • Freshness

Classic search techniques • Boolean • Ranked boolean • Vector space • Probabilistic / Bayesian

Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”

Everything’s a vector!! • Documents are vectors • Document collections are vectors • Queries are vectors • Topics are vectors

Cosine measurement of similarity • E1 . E2 / (|E1|*|E2|) = cos(E1,E2) • Rank doc’s against Q’s, measure similarity of doc’s, etc. • In example: • cos(doc1, doc2) ~ 1/3 • cos(doc1, doc3) ~ 2/3 • cos(doc2, doc3) ~ 1/2 • So: doc1&3 are closest

Weighting of terms in vectors • Salton’s “TF*IDF” • TF = term frequency in document • DF = doc frequency of term (# docs with term) • IDF = inverse doc freq. = 1/DF • Weight of term = TF * IDF • “Importance” of term determined by: • Count of term in doc (high ==> important) • Number of docs with term (low ==> important)

Relevance-feedback in VSM • Rocchio formula: • Q’ = F[Q, Relevant, Irrelevant] • Where F is weighted sum, such as: Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]

Remarks on VSM • Principled way of solving many IR/text processing problems, not just search • Tons of variations on VSM • Different term weighting schemes • Different similarity formulas • Normalization itself is a huge sub-industry

All of this goes out on Web • Very small, unrefined queries • Recall not an issue • Quality is the issue (want most relevant) • Precision-at-ten matters (how many total losers) • Scale precludes heavy VSM techniques • Corpus assumptions (e.g., unchanging, uniform quality) do not hold • “Adversarial IR” - new challenge on Web • Still, VSM important tool for Web Archeology

Information retrieval: overview

Information retrieval: overview

Presentation Transcript

Information retrieval

An Overview of Information Retrieval

Information Retrieval

Information retrieval

Modern Information Retrieval: A Brief Overview

Information Retrieval

Information Retrieval

Overview of Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

An Overview of Information Retrieval

Information Retrieval Overview

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval