CS144 Discussion Week 4 Information Retrieval

CS144 Discussion Week 4Information Retrieval Young Cha Oct. 25, 2013

Projects • Project 2 deadline is 11pm today (10/25) • 2 grace days  11pm 10/27 (Sun) • Please double check your implementation before submission • Project 3 has 3 parts and 2 submission deadlines • Part A: Building indexes (-11/1) No Grace Period Allowed! • Part B: Implementing Java search functions (-11/8) • Part C: Publishing Java class as Web service (-11/8) • You may resubmit your project 2 after fixing bugs • We don’t grade your Part A submission but we may check how different it is from your Part B/C submission  if it is largely different, briefly write down what has changed in README.txt

Boolean Model • Bag of words • Order doesn’t matter • Boolean query • AND/OR/NOT • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! lexicon/dictionary postings list

Vector Model • Tf-idf • f x log (N/n) • Cosine similarity • 3 documents • doc1: Bruins beat Trojans • doc2: Trojans envy Bruins • doc3: Bruins! Go Bruins! * Used N/n instead of log(N/n) for simplicity

Precision & Recall • 1K docs in a corpus • 50 relevant docs • Among 10 docs retrieved by a search engine, • 3 are relevant • 7 are irrelevant • Precision? • Recall? |R&D|/|D| = 0.3 |R&D|/|R| = 0.06 All Recall R:Relevant D:Retrieved Search Engine B 3 7 47 Search Engine A Precision

Index Size Estimation • Given that • 100 M docs • 5 KB/doc • 400 unique words/doc • 20 bytes/word • 10 bytes/docid • Questions • Document collection size? • Inverted index size? • Size of postings list? • Size of lexicion? (C=1, k=0.5 in Cˑnk ) 100M x 5KB = 500GB 400GB + 200KB 100M x 400 x 10B = 400GB (100M)0.5 x 20B = 200KB

Topic-model based IR • Topic models assume that there are hidden topics behind words • An IR system with topic models can match a doc containing automobile for a query vehicle as it assumes they come from the same topic car … automobile Can be matched … Searcher Author Topic-model based IR

Document Corpus Example • Document corpus (textual dataset)  matrix • Assumed hidden (latent) topics behind docs/words • We can infertopics by analyzing co-occurrence of docs and words • We can generatedocs by multiplying assumed doc-topic and topic-word matrices Inference = doc1: auto auto ... vehicle vehicle … doc2: film theater film theater … doc3: film … theater … vehicle … auto … X Generation Document-Word Observed Document-Word Observed Document-Topic Assumed Topic-Word Assumed

Latent Semantic Indexing (LSI) by SVD W (words) W S (diagonal) D (docs) X X D U VT C n x p n x n n x p p x p W (words) T (Topics) Rank Reduction to k W Ck (Rank-k appr.) X X D (Docs) D Uk Sk VkT T n x k k x k k x p n x p T-W D-T

Latent Semantic Indexing (LSI) by SVD • Query is viewed as a document  query matching is a process to find a similar document W (words) q q W 1 x p q Ck (Rank-k appr.) D X D W W (words) Ck (Rank-k appr.) D (docs) n x p p x 1 n x 1 Each value in the vector represents the similarity between q and di n x p

Example Topics - PLSI W T • We can group words with Topic-Word matrix

Lucene Example • Goal: build index for hotels to support keyword search • Each Hotel item has id, name, city, description • E.g. 1, Hotel Rivoli, Paris, If you like historical Paris … • 40 hotels • Requirements • Search over name, city, description or full text • In a search result page, you should show name, city and description • May need to be incorporated with RDB for a complex query • E.g. modern hotel in New York with price < $100

Lucene Example • We first need to create an IndexWriter

Lucene Example • Which field to store? to index?

Lucene Example • Now we can perform search using the index

CS144 Discussion Week 4 Information Retrieval

CS144 Discussion Week 4 Information Retrieval

Presentation Transcript

INFO624 -- Week 9 Effective Information Retrieval

Week #4: Discussion results

CS144 Discussion Week 5 WSDL

CS144 Discussion - 1

Information Retrieval

CS144 Discussion - Saxon

Discussion Week 4

Information Retrieval

CMPSC 60: Week 4 Discussion

Week 4 Information Technologies

CS144 Discussion Week 3 DTD/ XPath /Normalization

Discussion: Week 4/16/2014

CS 31 Discussion, Week 4

PSY 530 Week 4 Discussion

Devry BIS 155 Week 4 Discussion Latest Week 4 Discussion Latest

Week 4 Genome 540: Discussion

Information Retrieval

Information Retrieval

NURS6710 Week 4 Discussion 2023