110 likes | 235 Views
CMSC 11500 Introduction to Computer Programming November 27, 2002. Information Retrieval: aka “Google-lite”. Roadmap. Information Retrieval (IR) Goal: Match Information Need to Document Concept Solution: Vector Space Model Representation of Documents and Queries Computing Similarity
E N D
CMSC 11500 Introduction to Computer Programming November 27, 2002 Information Retrieval:aka “Google-lite”
Roadmap • Information Retrieval (IR) • Goal: Match Information Need to Document Concept • Solution: Vector Space Model • Representation of Documents and Queries • Computing Similarity • Implementation: • Indexing: Documents -> Vectors • Query Construction: Query -> Vector • Retrieval: Finding “Best” match: Query/Document
The Information Retrieval Task • Goal: • Match the information need expressed by user • (the Query) • With concepts in documents • (the Document collection) • Issues: • How do we represent documents and queries ? • How do we know if they're “similar”? Match?
Vector Space Model • Represent documents and queries with • Pattern of words • I.E. Queries and documents with lots of the same words • Vector of word occurrences: • Each position in vector = word • Value of position x in vector = # times word x occurs • Similarity: • Dot product of document vector & query vector • Biggest wins
Vector Space Model Tv Program Computer Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1
Information Retrieval in Scheme • Representation: • A vector-rep is (vectorof number) • (define-struct doc-rep (id vec)) • A doc is (make-doc-rep id vec) • Where id:symbol; vec: vector-rep • A doc-index is (listof doc) • A query is vector-rep • A simple-web-page (swp) is: • (make-swp h b) • Where (define-struct swp h b); h:symbol; b: (listof symbol)
Three Steps to IR • Three phases: • Indexing: Build collection of document representations • Convert web pages to doc-rep • Vectors of word counts • Query construction: • Convert query text to vector of word counts • Retrieval: • Compute similarity between query and doc representation • Return closest match
Words-to-vector (define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? Wlist) wvec) (else (let ((wpos (posn (car wlist) dict)))) (let ((cur-count (vector-ref wvec wpos))) (vector-set! Wvec wpos (+ cur-count 1)) (words-to-vector (cdr wlist) wvec))))) (define (posn wd dict) (cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict))))
Indexing (define (build-index swp-list) ;; build-index: (listof swp) -> (listof doc-rep) ;; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '()) (else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list)) (make-vector dictionary-size 0))) (build-index (cdr swp-list)))))
Query Construction (define (build-query wlist) ;; build-query: (listof symbol) -> vector-rep ;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))
Retrieval (define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol ;; Finds id of document with best match with query (doc-rep-id (max (lambda (doc) (dot-product (doc-rep-vec doc) query) index)))