1 / 11

Information Retrieval: aka “Google-lite”

CMSC 11500 Introduction to Computer Programming November 27, 2002. Information Retrieval: aka “Google-lite”. Roadmap. Information Retrieval (IR) Goal: Match Information Need to Document Concept Solution: Vector Space Model Representation of Documents and Queries Computing Similarity

min
Download Presentation

Information Retrieval: aka “Google-lite”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMSC 11500 Introduction to Computer Programming November 27, 2002 Information Retrieval:aka “Google-lite”

  2. Roadmap • Information Retrieval (IR) • Goal: Match Information Need to Document Concept • Solution: Vector Space Model • Representation of Documents and Queries • Computing Similarity • Implementation: • Indexing: Documents -> Vectors • Query Construction: Query -> Vector • Retrieval: Finding “Best” match: Query/Document

  3. The Information Retrieval Task • Goal: • Match the information need expressed by user • (the Query) • With concepts in documents • (the Document collection) • Issues: • How do we represent documents and queries ? • How do we know if they're “similar”? Match?

  4. Vector Space Model • Represent documents and queries with • Pattern of words • I.E. Queries and documents with lots of the same words • Vector of word occurrences: • Each position in vector = word • Value of position x in vector = # times word x occurs • Similarity: • Dot product of document vector & query vector • Biggest wins

  5. Vector Space Model Tv Program Computer Two documents: computer program, tv program Query: computer program : matches 1 st doc: exact: distance=2 vs 0 educational program: matches both equally: distance=1

  6. Information Retrieval in Scheme • Representation: • A vector-rep is (vectorof number) • (define-struct doc-rep (id vec)) • A doc is (make-doc-rep id vec) • Where id:symbol; vec: vector-rep • A doc-index is (listof doc) • A query is vector-rep • A simple-web-page (swp) is: • (make-swp h b) • Where (define-struct swp h b); h:symbol; b: (listof symbol)

  7. Three Steps to IR • Three phases: • Indexing: Build collection of document representations • Convert web pages to doc-rep • Vectors of word counts • Query construction: • Convert query text to vector of word counts • Retrieval: • Compute similarity between query and doc representation • Return closest match

  8. Words-to-vector (define (words-to-vector wlist wvec) ;; words-to-vector: (listof symbol) (vectorof num) -> (vectorof num) (cond ((null? Wlist) wvec) (else (let ((wpos (posn (car wlist) dict)))) (let ((cur-count (vector-ref wvec wpos))) (vector-set! Wvec wpos (+ cur-count 1)) (words-to-vector (cdr wlist) wvec))))) (define (posn wd dict) (cond ((null? Dict) (error “ missing word”)) ((eq? (map-wd (car dict)) wd) (map-num (car dict))) (else (posn wd (cdr dict))))

  9. Indexing (define (build-index swp-list) ;; build-index: (listof swp) -> (listof doc-rep) ;; Convert text of web pages to list of vector document reps (cond ((null? swp-list) '()) (else (cons (make-doc-rep (swp-header (car swp-list)) (words-to-vector (swp-body (car swp-list)) (make-vector dictionary-size 0))) (build-index (cdr swp-list)))))

  10. Query Construction (define (build-query wlist) ;; build-query: (listof symbol) -> vector-rep ;; Convert query text to vector of word occurrence counts (words-to-vector wlist (make-vector dict-size 0)))

  11. Retrieval (define (retrieve query index) ;; retrieve: vector-rep (listof doc-rep) -> symbol ;; Finds id of document with best match with query (doc-rep-id (max (lambda (doc) (dot-product (doc-rep-vec doc) query) index)))

More Related