210 likes | 397 Views
Information retrieval: overview. Information Retrieval and Text Processing. Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this Readings: Salton, Wong, Yang, A Vector Space Model for Automatic Indexing , CACM Nov 75 V18N11
E N D
Information Retrieval and Text Processing • Huge literature dating back to the 1950’s! • SIGIR/TREC - home for much of this • Readings: • Salton, Wong, Yang, A Vector Space Model for Automatic Indexing, CACM Nov 75 V18N11 • Tutle, Croft, Inference Networks for Document Retrieval, ???, [OPTIONAL]
Search Filtering Summarization Classification Clustering Information extraction Knowledge management Author identification …and more... IR/TP applications
Types of search • Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC • Discovery -- finding “interesting” documents given a high-level goal • Classic IR search is focused on discovery
Classic discovery problem • Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles) • Problem: retrieve documents relevant to user’s information need
Classical search Task Conception Info Need Formulation Query Search Refinement Corpus Results
Definitions • Task: example: write a Web crawler • Information need: perception of documents needed to accomplish task, e.g., Web specs • Query: sequence of characters given to a search engine one hopes will return desired documents
Conception • Translating task into information need • Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task • Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically
Translation • Translating info need into query syntax of particular search engine • Mis-translation: get this wrong • Operator error (is “a b” == a&b or a|b ?) • Polysemy -- same word, different meanings • Synonimy -- different words, same meaning • Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”
Refinement • Modification of query, typically in light of particular results, to better meet info need • Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)
Precision and recall • Classic metrics of search-result “goodness” • Recall = fraction of all good docs retrieved • |relevant results| / |all relevant docs in corpus| • Precision = fraction of results that are good • |relevant results| / |result-set size|
Precision and recall • Recall/precision trade-off: • Return everything ==> great recall, bad precision • Return nothing ==> great precision, bad recall • Precision curves • Search engine produces total ranking • Plot precision at 10%, 20%, .., 100% recall
Other metrics • Novelty / anti-redundancy • Information content of result set is disjoint • Comprehendible • Returned documents can be understood by user • Accurate / authoritative • Citation ranking!! • Freshness
Classic search techniques • Boolean • Ranked boolean • Vector space • Probabilistic / Bayesian
Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”
Everything’s a vector!! • Documents are vectors • Document collections are vectors • Queries are vectors • Topics are vectors
Cosine measurement of similarity • E1 . E2 / (|E1|*|E2|) = cos(E1,E2) • Rank doc’s against Q’s, measure similarity of doc’s, etc. • In example: • cos(doc1, doc2) ~ 1/3 • cos(doc1, doc3) ~ 2/3 • cos(doc2, doc3) ~ 1/2 • So: doc1&3 are closest
Weighting of terms in vectors • Salton’s “TF*IDF” • TF = term frequency in document • DF = doc frequency of term (# docs with term) • IDF = inverse doc freq. = 1/DF • Weight of term = TF * IDF • “Importance” of term determined by: • Count of term in doc (high ==> important) • Number of docs with term (low ==> important)
Relevance-feedback in VSM • Rocchio formula: • Q’ = F[Q, Relevant, Irrelevant] • Where F is weighted sum, such as: Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]
Remarks on VSM • Principled way of solving many IR/text processing problems, not just search • Tons of variations on VSM • Different term weighting schemes • Different similarity formulas • Normalization itself is a huge sub-industry
All of this goes out on Web • Very small, unrefined queries • Recall not an issue • Quality is the issue (want most relevant) • Precision-at-ten matters (how many total losers) • Scale precludes heavy VSM techniques • Corpus assumptions (e.g., unchanging, uniform quality) do not hold • “Adversarial IR” - new challenge on Web • Still, VSM important tool for Web Archeology