IR Theory: IR Basics & Vector Space Model

IR Theory:IR Basics & Vector Space Model

IR Approach Information Seeker Authors Information Need Concepts Query String Document Text Is the document relevant to the query? Search Engine

IR System Architecture Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

Step 1: Representation Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

How to represent text? • How do we represent the complexities of language? • Computers don’t “understand” documents or queries • Simple, yet effective approach: “bag of words” • Treat all the words in a document as index terms for that document • Disregard order, structure, meaning, etc. of the words Bag of Words McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. … 16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce taste Tuesday … Search Engine

aid 0 1 all 0 1 back 1 0 brown 1 0 come 0 1 dog 1 0 fox 1 0 good 0 1 jump 1 0 lazy 1 0 men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1 Bag-of-Word Representation Document 1 Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. Stopword List for is of Document 2 the to Now is the time for all good men to come to the aid of their party. Search Engine

Step 2: Term Weighting Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

Term Weight: What & How? • What is term weight? • Numerical estimate of term importance • How should we estimate the term importance? • Terms that appear often in a document should get high weights • The more often a document contains the term “dog”, the more likely that the document is “about” dogs. • Terms that appear in many documents should get low weights • Words like “the”, “a”, “of” appear in (nearly) all documents. • Term frequency in long documents should count less than those in short ones • How do we compute it? • Term frequency • Inverse document frequency • Document length Search Engine

Step 3: Matching/Ranking Documents Query Representation Module Representation Module Document Representation Query Representation Matching/Ranking Module Results Search Engine

Boolean vs. Vector Space Model • Boolean model • Based on the notion of sets • Documents are retrieved only if they satisfy Boolean conditions specified in the query • Does not impose a ranking on retrieved documents • Exact match • Vector space model • Based on geometry, the notion of vectors in high dimensional space • Documents are ranked based on their similarity to the query • Best/partial match Search Engine

Boolean Model: Overview • Weights assigned to terms are either “0” or “1” • “0” represents “absence”: term isn’t in the document • “1” represents “presence”: term is in the document • Build queries by combining terms with Boolean operators • AND, OR, NOT • The system returns all documents that satisfy the query A OR B A B A AND B A AND NOT(B) Search Engine

Boolean Model: Strength • Boolean operators approximate natural language • To find documents about a good party • AND can discover relationships between concepts • good party • OR can discover alternate terminology • excellent party, wild party, etc. • NOT can discover alternate meanings • Democratic party • Precise, if you know the right strategies • Precise, if you have an idea of what you’re looking for • Efficient for the computer Search Engine

Boolean Model: Weakness • Natural language is way more complex • Boolean logic insufficient to capture the richness of language • AND “discovers” nonexistent relationships • Terms in different sentences, paragraphs, … • Money is good, but I won’t be party to stealing. • Guessing terminology for OR is hard • good, nice, excellent, outstanding, awesome, … • Guessing terms to exclude is even harder! • Democratic party, party to a lawsuit, … • No control over size of result set • Too many documents or none • All documents in the result set are considered “equally good” • No Partial Matching • Documents that “don’t quite match” the query may be useful also Search Engine

Vector Space Model: Representation • “Bags of words” can be represented as vectors • Computational efficiency • Ease of manipulation • Geometric metaphor: “arrows” • A vector is a set of values recorded in any consistent order “The quick brown fox jumped over the lazy dog’s back” Vector Bag of words  (1, 1, 1, 1, 1, 1, 1, 1, 2) 1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the” Search Engine

Vector Space Model: Ranked Retrieval • Order documents by “relevance” • Relevance = how likely they are to be relevant to the information need • Some documents are “better” than others • Users can decide when to stop reading • Best (partial) match • Documents need not have all query terms • Documents with more query terms should be “better” • Estimate relevance with query-document similarity • Treat the query as if it were a document • Create a query bag-of-words • Compute term weights • Find its similarity to each document • Rank order the documents by similarity • Works surprisingly well Search Engine

Vector Space Model: 3-D Example • A vector A in a 3-dimensional space • Represented with initial point at the origin of a rectangular coordinate system. • Projections of A on the x, y, and z axes: Ax, Ay, and Az • the (rectangular) components of A in the x, y, and z directions • each axis represents a term (e.g., x = all, y = brown, z = cat) z Az A y Ay Ax x Search Engine

Vector Space Model: Postulate t3 d2 d3 d1 θ t1 d5 t2 d4 Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”) Search Engine

Similarity Measures: Set-based • Simple matching function • Dice’s coefficient • Jaccard’s coefficient A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) • A  B:intersection ofA and B • the set of elements that belongs to both A and B • A  B = (wd2, wd4) • A  B: union ofA and B • the set of elements that belongs to either A or B. • A  B = (wd1, wd2, wd3, wd4, wd5, wd6) • |A| : cardinality ofA • the number of elements in A • |A| = 5 • |B| = 3 • |A  B| = 2 • |A  B| = 6 • Similarity Scores • Simple: |A  B| = 2 • Dice: 2* |A  B| / (|A|+ |B|) = 2*2 /8 = 1/2 • Jaccard: |A  B| / |A  B| = 2/6 = 1/3 Search Engine

Similarity Measures: Set-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1 O2| = |(A1)| = 1 |O1 O3| = |(A1, A4)| = 2 |O1 O4| = |(A1, A4)| = 2 |O1 O5| = |(A1, A3, A4 , A8)| = 4 |O1 O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1 O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1 O5| = |(A1, A2, A3, A4, A7, A8)| = 6 • Simple matching function • By Dice’s coefficient • By Jaccard’s coefficient Search Engine

Similarity Measures: Vector-based • Cosine Similarity (n-dimensional space) • Dot/Scalar product of vectors / product of vector lengths • Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4)AB = (A1B1+A2B2+A3B3+A4B4) • Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2] |B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2] • Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz) Search Engine

Similarity Measures: Vector-based Example Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4) |O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3) |O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4) |O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5) |O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6) O1O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4 • Compute cosine similarities • Rank objects O2 through O5 by descending order of similarity to O1 Search Engine

Text Analysis: Word Frequency B. Croft (Umass) • TREC Volume 3 Corpus • Number of documents: 336,310 • Total word occurrences: 125,720,891 • Unique words: 508,209 • Zipf Distribution • Rank*Frequency = constant • Population, Wealth, Popularity • A few words are very common • Most words are very rare • Term Weights • Represents the ability of terms to identify relevant items & to distinguish them from non-relevant material • Very common & very rare words are notvery useful for indexing (Luhn, 1958) • Good • Smaller index  Faster retrieval • Bad • Lost gems & broken phrases Search Engine

aid 0 1 all 0 1 back 1 0 brown 1 0 come 0 1 dog 2 1 fox 3 0 good 0 1 jump 1 0 lazy 1 0 men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1 Text Analysis: Term Weighting • Term Weighting Factors • Term frequency (tf) • Number of times that a term occurs in a given document • tf(dogd1) = 2, tf(dogd2) = 1 • tf(foxd1) = 3, tf(foxd2) = 0 • tf(partyd1) = 0, tf(partyd2) = 1 • Inverse document frequency (idf) • (Simple)1/number of document in which a term occurs • idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1 • (Default)log(Nd/ number of document in which a term occurs) • Nd = number of document in a collection • idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3 • Document length (dlen) • Number of tokens in a document • Token = an instance/occurrence of a word (not unique word) • dlen(d1) = 11, dlen(d2) = 10 • tfidf formula Term d1 d2 wki = weight of term k in document ifki = frequency of term k in document i (tf)Nd = number of documents in collectiondk = number of documents in which term k appears (postings) Search Engine

Similarity Measures: using Term Weights Document – Term array • Compute term weights (e.g., tf*idf) • Nd = 5 • d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2 Search Engine

Similarity Measures: using Term Weights • Compute query-document cosine similarity with tf*idf weights Search Engine

IR Theory: IR Basics & Vector Space Model