Basic IR: Modeling

Basic IR: Modeling • Basic IR Task: • Match a subset of documents to the user’s query • Slightly more complex: • and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models.

Concepts: Term-Document Incidence Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise. Queries satisfied how? Problems?

Concepts: Term Frequency • To support document ranking, need more than just term incidence. • Term frequency records number of times a given term appears in each document. • Intuition: More times a term appears in a document the more central it is to the topic of the document.

Concept: Term Weight • Weights represent the importance of a given term for characterizing a document. • wij is a weight for term i in document j.

Mapping Task and Document Type to Model

U s e r T a s k Retrieval: Adhoc Filtering Algebraic Browsing Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models from MIR text

Classic Models: Basic Concepts • Ki is an index term • dj is a document • t is the total number of docs • K = (k1, k2, …, kt) is the set of all index terms • wij >= 0 is a weight associated with (ki,dj) • wij = 0 indicates that term does not belong to doc • vec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document dj • gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)

Classic: Boolean Model • Based on set theory: map queries with Boolean operations to set operations • Select documents from term-document incidence matrix Pros: Cons:

Exact Matching Ignores… • term frequency in document • term scarcity in corpus • size of document • ranking

Vector Model • Vector of term weights based on term frequency • Compute similarity between query and document where both are vectors • vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) • Similarity is the cosine of the angle between the vectors.

j dj  q Cosine Measure Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 from MIR notes

How to Set Wij Weights? TF-IDF • Within document: Term-Frequency • tf measures term density within a document • Across document: Inverse Document Frequency • idf measures informativeness or rarity of term across corpus.

TF * IDF Computation • What happens as number of occurrences in a document increases? • What happens as term becomes more rare?

TF * IDF • TF may be normalized. • tf(i,d) = freq(i,d) / max(freq(l,d)) • IDF is computed • normalized to size of corpus • as log to make TF and IDF values comparable • IDF requires a static corpus.

How to Set Wi,qWeights? • Create Vector directly from query • Use modified tf-idf

The Vector Model: Example k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 from MIR notes

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) • Compute Tf-IDF Vector for each document For first document: K1: ((2/2)*(log (7/5)) = .33 K2: (0*(log (7/4))) = 0 K3: ((1/2)*(log (7/3))) = .42 for rest: [.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0] from MIR notes

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) 2. Compute the Tf-IDF for the query [1 2 3]: K1: (.5 + ((.5 * 1)/3))*(log (7/5))) K2: (.5 + ((.5 * 2)/3))*(log (7/4))) K3: (.5 + ((.5 * 3)/3))*(log (7/3))) which is: [.22 .47 .85]

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example (cont.) 3. Compute the Sim for each document: D1: D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43 |D1| = sqrt((.33^2) + (.42^2)) = .53 |q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0 sim = .43 / (.53 * 1.0) = .81 D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47

Vector Model Implementation Issues • Sparse TermXDocument matrix • Store term count, term weight, or weighted by idfi ? • What if the corpus is not fixed (e.g., the Web)? What happens to IDF? • How to efficiently compute Cosine for large index?

Heuristics for Computing Cosine for Large Index • Select from only non-zero cosines • Focus on non-zero cosines for rare (high idf) words • Pre-compute document adjacency • for each term, pre-compute k nearest docs • for a t term query, compute cosines from query to union of t pre-computed lists, choose top k

The TFIDF Vector Model: Pros/Cons • Pros: • term-weighting improves quality • cosine ranking formula sorts documents according to degree of similarity to the query • Cons: • assumes independence of index terms

Basic IR: Modeling