1 / 21

Information retrieval: overview

Information retrieval: overview. Information Retrieval and Text Processing. Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this Readings: Salton, Wong, Yang, A Vector Space Model for Automatic Indexing , CACM Nov 75 V18N11

calais
Download Presentation

Information retrieval: overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information retrieval: overview

  2. Information Retrieval and Text Processing • Huge literature dating back to the 1950’s! • SIGIR/TREC - home for much of this • Readings: • Salton, Wong, Yang, A Vector Space Model for Automatic Indexing, CACM Nov 75 V18N11 • Tutle, Croft, Inference Networks for Document Retrieval, ???, [OPTIONAL]

  3. Search Filtering Summarization Classification Clustering Information extraction Knowledge management Author identification …and more... IR/TP applications

  4. Types of search • Recall -- finding documents one knows exists, e.g., an old e-mail message or RFC • Discovery -- finding “interesting” documents given a high-level goal • Classic IR search is focused on discovery

  5. Classic discovery problem • Corpus: fixed collection of documents, typically “nice” docs (e.g., NYT articles) • Problem: retrieve documents relevant to user’s information need

  6. Classical search Task Conception Info Need Formulation Query Search Refinement Corpus Results

  7. Definitions • Task: example: write a Web crawler • Information need: perception of documents needed to accomplish task, e.g., Web specs • Query: sequence of characters given to a search engine one hopes will return desired documents

  8. Conception • Translating task into information need • Mis-conception: identify too little (tips on high-bandwidth DNS lookups) and/or too much (TCP spec) as relevant to task • Sometimes a little extra breadth in results can tip user off to need to refine info need, but not much research into dealing with this automatically

  9. Translation • Translating info need into query syntax of particular search engine • Mis-translation: get this wrong • Operator error (is “a b” == a&b or a|b ?) • Polysemy -- same word, different meanings • Synonimy -- different words, same meaning • Automation: “NLP”, “easy syntax”, “query expansion”, “Q&A”

  10. Refinement • Modification of query, typically in light of particular results, to better meet info need • Lots of work of refining query automatically (often with some input from user, e.g., “relevance feedback”)

  11. Precision and recall • Classic metrics of search-result “goodness” • Recall = fraction of all good docs retrieved • |relevant results| / |all relevant docs in corpus| • Precision = fraction of results that are good • |relevant results| / |result-set size|

  12. Precision and recall • Recall/precision trade-off: • Return everything ==> great recall, bad precision • Return nothing ==> great precision, bad recall • Precision curves • Search engine produces total ranking • Plot precision at 10%, 20%, .., 100% recall

  13. Other metrics • Novelty / anti-redundancy • Information content of result set is disjoint • Comprehendible • Returned documents can be understood by user • Accurate / authoritative • Citation ranking!! • Freshness

  14. Classic search techniques • Boolean • Ranked boolean • Vector space • Probabilistic / Bayesian

  15. Term vector basics • Basic abstraction for information retrieval • Useful for measuring “semantic” similarity of text • A row in the above table is a “term vector” • Columns are word stems and phrases • Trying to capture “meaning”

  16. Everything’s a vector!! • Documents are vectors • Document collections are vectors • Queries are vectors • Topics are vectors

  17. Cosine measurement of similarity • E1 . E2 / (|E1|*|E2|) = cos(E1,E2) • Rank doc’s against Q’s, measure similarity of doc’s, etc. • In example: • cos(doc1, doc2) ~ 1/3 • cos(doc1, doc3) ~ 2/3 • cos(doc2, doc3) ~ 1/2 • So: doc1&3 are closest

  18. Weighting of terms in vectors • Salton’s “TF*IDF” • TF = term frequency in document • DF = doc frequency of term (# docs with term) • IDF = inverse doc freq. = 1/DF • Weight of term = TF * IDF • “Importance” of term determined by: • Count of term in doc (high ==> important) • Number of docs with term (low ==> important)

  19. Relevance-feedback in VSM • Rocchio formula: • Q’ = F[Q, Relevant, Irrelevant] • Where F is weighted sum, such as: Q’[t] = a*Q[t]+b*sum_i R_ i[t]+c*sum_i I_ i[t]

  20. Remarks on VSM • Principled way of solving many IR/text processing problems, not just search • Tons of variations on VSM • Different term weighting schemes • Different similarity formulas • Normalization itself is a huge sub-industry

  21. All of this goes out on Web • Very small, unrefined queries • Recall not an issue • Quality is the issue (want most relevant) • Precision-at-ten matters (how many total losers) • Scale precludes heavy VSM techniques • Corpus assumptions (e.g., unchanging, uniform quality) do not hold • “Adversarial IR” - new challenge on Web • Still, VSM important tool for Web Archeology

More Related