210 likes | 467 Views
CS246. Basic Information Retrieval. Today’s Topic. Basic Information Retrieval (IR) Bag of words assumption Boolean Model Inverted index Vector-space model Document-term matrix TF-IDF vector and cosine similarity Phrase queries Spell correction. Information-Retrieval System.
E N D
CS246 Basic Information Retrieval
Today’s Topic • Basic Information Retrieval (IR) • Bag of words assumption • Boolean Model • Inverted index • Vector-space model • Document-term matrix • TF-IDF vector and cosine similarity • Phrase queries • Spell correction
Information-Retrieval System • Information source: Existing text documents • Keyword-based/natural-language query • The system returns best-matching documents given the query • Challenge • Both queries and data are “fuzzy” • Unstructured text and “natural language” query • What documents are good matches for a query? • Computers do not “understand” the documents or the queries • Developing a computerizable “model” is essential to implement this approach
Bag of Words: Major Simplification • Consider each document as a “bag of words” • “bag” vs “set” • Ignore word ordering, but keep word count • Consider queries as bag of words as well • Great oversimplification, but works adequately in many cases • “John loves only Jane” vs “Only John loves Jane” • The limitation still shows up on current search engines • Still how do we match documents and queries?
Boolean Model • Return all documents that contain the words in the query • Simplest model for information retrieval • No notion of “ranking” • A document is either a match or non-match • Q: How to find and return matching documents? • Basic algorithm? • Useful data structure?
Inverted Index • Allows quick lookup of document ids with a particular word • Q: How can we use this to answer “UCLA Physics”? Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …
Inverted Index • Allows quick lookup of document ids with a particular word Postings list lexicon/dictionary DIC PL(Stanford) Stanford PL(UCLA) UCLA MIT PL(MIT) …
Size of Inverted Index (1) • 100M docs, 10KB/doc, 1000 unique words/doc, 10B/word, 4B/docid • Q: Document collection size? • Q: Inverted index size? • Heap’s Law: Vocabulary size = k nb with 30 < k < 100 and 0.4 < b < 1 • k = 50 and b = 0.5 are good rule of thumb
Size of Inverted Index (2) • Q: Between dictionary and postings lists, which one is larger? • Q: Lengths of postings lists? • Zipf’s law: collection term frequency 1/frequency rank • Q: How do we construct an inverted index?
Inverted Index Construction C: set of all documents (corpus) DIC: dictionary of inverted index PL(w): postings list of word w 1: For each document d C: 2: Extract all words in content(d) into W 3: For each w W: 4: If w DIC, then add w to DIC 5: Append id(d) to PL(w) Q: What if the index is larger than main memory?
Inverted-Index Construction • For large text corpus • Block-sorted based construction • Partition and merge
Evaluation: Precision and Recall • Q: Are all matching documents what users want? • Basic idea: a model is good if it returns document if and only if it is “relevant”. • R: set of “relevant” documentD: set of documents returned by a model
Vector-Space Model • Main problem of Boolean model • Too many matching documents when the corpus is large • Any way to “rank” documents? • Matrix interpretation of Boolean model • Document – Term matrix • Boolean 0 or 1 value for each entry • Basic idea • Assign real-valued weight to the matrix entries depending on the importance of the term • “the” vs “UCLA” • Q: How should we assign the weights?
TF-IDF Vector • A term t is important for document d • If t appears many times in d or • If t is a “rare” term • TF: term frequency • # occurrence of t in d • IDF: inverse document frequency • # documents containing t • TF-IDF weighting • TF X Log(N/IDF) • Q: How to use it to compute query-document relevance?
Cosine Similarity • Represent both query and document as a TF-IDF vector • Take the inner product of the two normalized vectors to compute their similarity • Note: |Q| does not matter for document ranking. Division by |D| penalizes longer document.
Cosine Similarity: Example • idf(UCLA)=10, idf(good)=0.1, idf(university) = idf(car) = idf(racing) = 1 • Q = (UCLA, university), D = (car, racing) • Q = (UCLA, university), D = (UCLA, good) • Q = (UCLA, university), D = (university, good)
Finding High Cosine-Similarity Documents • Q: Under vector-space model, does precision/recall make sense? • Q: How to find the documents with highest cosine similarity from corpus? • Q: Any way to avoid complete scan of corpus?
Word IDF docid TF 1/3530 Stanford D1 2 Lexicon Postinglist 1/9860 UCLA D14 30 1/937 8 MIT D376 … (TF may be normalized by document size) Inverted Index for TF-IDF • Q · di = 0 if di has no query words • Consider only the documents with query words • Inverted Index: Word Document 18
Phrase Queries • “Havard University Boston” exactly as a phrase • Q: How can we support this query? • Two approaches • Biword index • Positional index • Q: Pros and cons of each approach? • Rule of thumb: x2 – x4 size increase for positional index compared to docid only
Spell correction • Q: What is the user’s intention for the query “Britnie Spears”? How can we find the correct spelling? • Given a user-typed word w, find its correct spelling c. • Probabilistic approach: Find c with the highest probability P(c|w). • Q: How to estimate it? • Bayes’ rule: P(c|w) = P(w|c)P(c)/P(w) • Q: What are these probabilities and how can we estimate them? • Rule of thumb: 75% misspells are within edit distance 1. 98% are within edit distance 2.
Summary • Boolean model • Vector-space model • TF-IDF weight, cosine similarity • Inverted index • Boolean model • TF-IDF model • Phrase queries • Spell correction