150 likes | 325 Views
CS144 Discussion Week 4 Information Retrieval. Young Cha Oct. 25, 2013. Projects. Project 2 deadline is 11pm today (10/25) 2 grace days 11pm 10/27 (Sun) Please double check your implementation before submission Project 3 has 3 parts and 2 submission deadlines
E N D
CS144 Discussion Week 4Information Retrieval Young Cha Oct. 25, 2013
Projects • Project 2 deadline is 11pm today (10/25) • 2 grace days 11pm 10/27 (Sun) • Please double check your implementation before submission • Project 3 has 3 parts and 2 submission deadlines • Part A: Building indexes (-11/1) No Grace Period Allowed! • Part B: Implementing Java search functions (-11/8) • Part C: Publishing Java class as Web service (-11/8) • You may resubmit your project 2 after fixing bugs • We don’t grade your Part A submission but we may check how different it is from your Part B/C submission if it is largely different, briefly write down what has changed in README.txt
Boolean Model • Bag of words • Order doesn’t matter • Boolean query • AND/OR/NOT • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! lexicon/dictionary postings list
Vector Model • Tf-idf • f x log (N/n) • Cosine similarity • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! * Used N/n instead of log(N/n) for simplicity
Precision & Recall • 1K docs in a corpus • 50 relevant docs • Among 10 docs retrieved by a search engine, • 3 are relevant • 7 are irrelevant • Precision? • Recall? |R&D|/|D| = 0.3 |R&D|/|R| = 0.06 All Recall R:Relevant D:Retrieved Search Engine B 3 7 47 Search Engine A Precision
Index Size Estimation • Given that • 100 M docs • 5 KB/doc • 400 unique words/doc • 20 bytes/word • 10 bytes/docid • Questions • Document collection size? • Inverted index size? • Size of postings list? • Size of lexicion? (C=1, k=0.5 in Cˑnk ) 100M x 5KB = 500GB 400GB + 200KB 100M x 400 x 10B = 400GB (100M)0.5 x 20B = 200KB
Topic-model based IR • Topic models assume that there are hidden topics behind words • An IR system with topic models can match a doc containing automobile for a query vehicle as it assumes they come from the same topic car … automobile Can be matched … Searcher Author Topic-model based IR
Document Corpus Example • Document corpus (textual dataset) matrix • Assumed hidden (latent) topics behind docs/words • We can infertopics by analyzing co-occurrence of docs and words • We can generatedocs by multiplying assumed doc-topic and topic-word matrices Inference = doc1: auto auto ... vehicle vehicle … doc2: film theater film theater … doc3: film … theater … vehicle … auto … X Generation Document-Word Observed Document-Word Observed Document-Topic Assumed Topic-Word Assumed
Latent Semantic Indexing (LSI) by SVD W (words) W S (diagonal) D (docs) X X D U VT C n x p n x n n x p p x p W (words) T (Topics) Rank Reduction to k W Ck (Rank-k appr.) X X D (Docs) D Uk Sk VkT T n x k k x k k x p n x p T-W D-T
Latent Semantic Indexing (LSI) by SVD • Query is viewed as a document query matching is a process to find a similar document W (words) q q W 1 x p q Ck (Rank-k appr.) D X D W W (words) Ck (Rank-k appr.) D (docs) n x p p x 1 n x 1 Each value in the vector represents the similarity between q and di n x p
Example Topics - PLSI W T • We can group words with Topic-Word matrix
Lucene Example • Goal: build index for hotels to support keyword search • Each Hotel item has id, name, city, description • E.g. 1, Hotel Rivoli, Paris, If you like historical Paris … • 40 hotels • Requirements • Search over name, city, description or full text • In a search result page, you should show name, city and description • May need to be incorporated with RDB for a complex query • E.g. modern hotel in New York with price < $100
Lucene Example • We first need to create an IndexWriter
Lucene Example • Which field to store? to index?
Lucene Example • Now we can perform search using the index