300 likes | 518 Views
Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/. Information Retrieval and Web Search. Indexing and Searching. Outline. Sequential online text searching Raw text Appropriate for small texts or fast-changing texts Use index structures for fast access
E N D
Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Information Retrieval and Web Search
Indexing and Searching Outline
Sequential online text searching Raw text Appropriate for small texts or fast-changing texts Use index structures for fast access Appropriate for large and semi-static collections Hybrid methods: online+indexed search Searching
Inverted files Suffix arrays Signature files Indexing
IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking
Text Operations forms index words (tokens) Tokenization Stopword removal Stemming Indexing constructs an inverted index of word to document pointers Mapping from keywords to document ids IR System Components Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Searching retrieves documents that contain a given query token from the inverted index Ranking scores all retrieved documents according to a relevance metric User Interface manages interaction with the user: Query input and document output Relevance feedback Visualization of results Query Operations transform the query to improve retrieval: Query expansion using a thesaurus Query transformation using relevance feedback IR System Components
Simple, mathematically based approach Considers both local (tf) and global (idf) word importance measures Provides partial matching and ranked results Tends to work quite well in practice despite obvious weaknesses Allows efficient implementation for large document collections Comments on Vector Space Models
Missing semantic information (e.g. word sense) Missing syntactic information (e.g. phrase structure, word order, proximity information) Assumption of term independence (e.g. ignores synonymy) Lacks the control of a Boolean model (e.g., requiring a term to appear in a document) Given a two-term query “A B”, may prefer a document containing A frequently but not B, over a document that contains both A and B, but both less frequently Problems with Vector Space Model
Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V Convert query to a tf-idf-weighted vector q For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score Present top ranked documents to the user Time complexity: O(|V|·|D|)Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000 Naïve Implementation
Based on the observation that documents containing none of the query keywords do not affect the final ranking Try to identify only those documents that contain at least one query keyword Actual implementation of an inverted index Practical Implementation
Implement the preprocessing functions: For tokenization For stop word removal For stemming (Map everything to lowercase; Keep only sequences of alphabetical characters) Input: Documents that are read one by one from the collection Output: Tokens to be added to the index No punctuation, no stop-words, stemmed Step 1: Preprocessing
Build an inverted index, with an entry for each word in the vocabulary Input: Tokens obtained from the preprocessing module Output: An inverted index for fast access Step 2: Indexing
Many data structures are appropriate for fast access B-trees, skipped lists, hashtables We need: One entry for each word in the vocabulary For each such entry: Keep a list of all the documents where it appears together with the corresponding frequency TF For each such entry, keep the total number of occurrences in all documents: DF Step 2 (cont’d)
Dj, tfj df Index terms D7, 4 3 computer database D1, 3 2 D2, 4 4 science system 1 D5, 2 lists Index file Step 2 (cont’d)
TF and IDF for each token can be computed in one pass Cosine similarity also requires document lengths Need a second pass to compute document vector lengths Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens Remember the weight of a token is: TF * IDF Therefore, must wait until IDFs are known (and therefore until all documents are indexed) before document lengths can be determined Do a second pass over all documents: keep a list or hash table with all document id-s, and for each document determine its length Step 2 (cont’d)
Let N be the total number of Documents; For each token, T, in V: Determine the total number of documents, m, in which T occurs Set the IDF for T to log(N/m); Computing IDF
Assume the length of all document vectors are initialized to 0.0; For each token T in V: Let, I, be the IDF weight of T; For each token T in document D Let, C, be the count of T in D; Increment the length of D by (I*C)2; For each document D: Set the length of D to be the square-root of the current stored length; Computing Document Lengths
Complexity of creating vector and indexing a document of n tokens is O(n). So indexing m such documents is O(m n). Computing token IDFs for a vocabularly V is O(|V|). Computing vector lengths is also O(m n). Since |V| m n, complete process is O(m n), which is also the complexity of just reading in the corpus. Time Complexity of Indexing
Use inverted index (from step 2) to find the limited set of documents that contain at least one of the query words Incrementally compute cosine similarity of each indexed document as query words are processed one by one To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where the document id is the key, and the partial accumulated score is the value Input: Query and Inverted Index (from Step 2) Output: Similarity values between query and documents Step 3: Retrieval
Tokens that are not in both the query and the document do not effect cosine similarity Product of token weights is zero and does not contribute to the dot product Usually the query is fairly short, and therefore its vector is extremely sparse Use inverted index to find the limited set of documents that contain at least one of the query words Retrieval with an Inverted Index
Assume that, on average, a query word appears in B documents: Then retrieval time is O(|Q| B), which is typically, much better than naïve retrieval that examines all N documents, O(|V| N), because |Q| << |V| and B << N Inverted Query Retrieval Efficiency Q = q1 q2 … qn D11…D1B D21…D2B Dn1…DnB
Incrementally compute cosine similarity of each indexed document as query words are processed one by one To accumulate a total score for each retrieved document, store retrieved documents in a hashtable, where DocumentReference is the key and the partial accumulated score is the value Processing the Query
Initialize an empty list R of retrieved documents For each token, T, in query Q: Let I be the IDF of T, and K be the count of T in Q; Set the weight of T in Q: W = K * I; Let L be the list of document frequencies for T; For each entry, O, in L: Let D be the document of O, and C be the count of O (tf of T in D); If D is not already in R (D was not previously retrieved) Then add D to R and initialize score to 0.0; Increment D’s score by W * I * C; (product of T-weight in Q and D) Inverted-Index Retrieval Algorithm
Compute the length, L, of the vector Q (square-root of the sum of the squares of its weights) For each retrieved document D in R: Let S be the current accumulated score of D; (S is the dot-product of D and Q) Let Y be the length of D; Normalize D’s final score to S/(L * Y); Sort retrieved documents in R by final score and return results Retrieval Algorithm (cont)
Sort the hashtable including the retrieved documents based on the value of cosine similarity sort {$retrieved{$b} $retrieved{$a}} keys %retrieved Return the documents in descending order of their relevance Input: Similarity values between query and documents Output: Ranked list of documents in reversed order of their relevance Step 4: Ranking
Indexing and Searching Summary
Intro to Web Search Next