100 likes | 216 Views
TF/IDF Ranking. Vector space model. Documents are also treated as a “bag” of words or terms. Each document is represented as a vector. Term Frequency (TF) Scheme: Weight of a term t i in document d j is the number of times that t i appears in d j , denoted by f ij.
E N D
Vector space model • Documents are also treated as a “bag” of words or terms. • Each document is represented as a vector. • Term Frequency (TF) Scheme: • Weight of a term tiin document djis the number of times that tiappears in dj, denoted by fij.
Why not just frequency • Shortcoming of the TF scheme is that it doesn’t consider the situation where a term appears in many documents of the collection. • E.g. "flight" in a document collection about airplanes. • Such a term may not be discriminative.
TF-IDF term weighting scheme • The most well known weighting scheme • TF: (normalized) term frequency • IDF: inverse document frequency. • Penalizes terms (words) that occur too often in the document collection. N: total number of docs dfi: the number of docs that ti appears. • The final TF-IDF term weight is: where Each document will be a vector of such numbers.
Retrieval in the vector space model • Query q is represented in the same way as a document. • The term wiq of each term ti in q can also computed in the same way as in a document. • Relevance of dj to q: Compare the similarity of query q and document dj. • For this, use cosine similarity (the cosine of the angle between the two vectors) • The bigger the cosine the smaller the angle and the higher the similarity Unit vectors Dot product Where h is number of words (terms) in q, and k is the number of words (terms) in d.
Document Frequency • Suppose query is: calpurnia animal
Computing the Cosine Similarity in Practice • Only the terms mentioned by the query matter in qd. • |d| can be computed offline for each document and stored in a document table (docid, |d|). • E.g. d = "I like the red apple." Suppose the idf's are: • I: 1, like: 2, red: 5, apple: 10. Since the tf's in this example are 1 for each word, |d| = sqrt( 12 + 22 + 52 + 102 ) = 11.4 • |q| is easily computed online. • E.g. if q = red apple, |q| = sqrt( 52 + 102 ) = 11.18 • Score for this document is: qd / |q||d| = (5*5+10*10 ) / (11.4*11.18) = .98
Computing the Cosine Similarity in Practice (2) • We store maxfj’s in the document table which will now have (docid, |dj|, maxfj) for each document dj. • We store idf’s in a word table which will have (ti, idfi) for each word ti. • This can be implemented by using a HashMap of word-doc_counter pairs. • After building this HashMap, iterate it and insert pairs (word, N/doc_counter) in the word table. • Note. idf’s are retrieved for each word mentioned in the query. These are the only words that matter in computing the dot-product in the numerator of the cosine similarity formula. • To compute tfij=fij / maxfj, we retrieve • fij from the inverted index • maxfj from the document table