360 likes | 376 Views
This lecture introduces the concepts of Boolean search and ranking techniques in information retrieval. It explores the strengths and weaknesses of Boolean search and discusses the importance of ranking documents based on their relevance to a query. The lecture also covers term weighting techniques such as term frequency and inverse document frequency.
E N D
Information Retrieval For the MSc Computer Science Programme Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 Dell Zhang Birkbeck, University of London
Boolean Search • Strength • Docs either match or not. • Good for expert users with precise understanding of their needs and the corpus. • Weakness • Not good for (the majority of) users with poor Boolean formulation of their needs. • Applications may consume 1000’s of results, but most users don’t want to wade through 1000’s of results – cf. use of Web search engines.
Beyond Boolean Search • Solution: Ranking • We wish to return in order the documents most likely to be useful to the searcher. • How can we rank/order the docs in the corpus with respect to a query? • Assign a score – say in [0,1]– for each doc on each query.
Document Scoring • Idea: More is Better • If a document talks about a topic more, then it is a better match. • That is to say, a document is more relevant if it has morerelevant terms. • This leads to the problem of term weighting.
Bag-Of-Words (BOW) Model • Term-Document Count Matrix • Each document corresponds to a vector in ℕv, i.e., a column below. The matrix element A(i,j) is the number of occurrences of the i-th term in the j-th doc.
Bag-Of-Words (BOW) Model • Simplification • In the BOW model, • the doc • John is quicker than Mary. • is indistinguishable from the doc • Mary is quicker than John.
Term Frequency (TF) • Digression: Terminology • WARNING: In a lot of IR literature, “frequency” is used to mean “count”. • Thus term frequency in IR literature is used to mean the number of occurrencesof a term in a document not divided by document length (which would actually make it a frequency). • We will conform to this misnomer: in saying term frequency we mean the number of occurrences of a term in a document.
Term Frequency (TF) • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc, • 1 vs. 2 occurrences, • 2 vs. 3 occurrences, ……? • Can just use raw tf . • While it seems that more is better, a lot isn’t proportionally better than a few. • So another option commonly used in practice:
Term Frequency (TF) • The score of a document d for a query q • 0 if no query terms in document • wfcan be used instead of tf in the above
Term Frequency (TF) • Is TF good enough for weighting? • Ignorance of document length • Long docs are favored because they’re more likely to contain query terms. • This can be fixed to some extent by normalizing for document length. [talk later]
Term Frequency (TF) • Is TF good enough for weighting? • Ignorance of term rarity in corpus • Consider the query ides of march. • Julius Caesar has 5 occurrences of ides , while no other play has ides . • march occurs in over a dozen. • All the plays contain of . • By this weighting scheme, the top-scoring play is likely to be the one with the most ofs.
Document/Collection Frequency • Which of these tells you more about a doc? • 5 occurrences of of? • 5 occurrences of march? • 5 occurrences of ides? • We’d like to attenuate the weight of a common term. But what is “common”? • Collection Frequency (CF) • the number of occurrences of the term in the corpus • Document Frequency (DF) • the number of docs in the corpus containing the term
Document/Collection Frequency • DF may be better than CF WordCFDF try10422 8760 insurance10440 3997 So how do we make use of DF?
Inverse Document Frequency (IDF) • Could just be the reciprocal of DF (idfi = 1/dfi). • But by far the most commonly used version is:
Inverse Document Frequency (IDF) • Prof Karen Spark Jones 1935-2007
TFxIDF • TFxIDFweighting scheme combines: • Term Frequency (TF) • measure of term density in a doc • Inverse Document Frequency (IDF) • measure of informativeness of a term: its rarity across the whole corpus
TFxIDF • Each term i in each document d is assigned a TFxIDF weight • Increases with the number of occurrences within a doc. • Increases with the rarity of the term across the whole corpus. What is the weight of a term thatoccurs in allof the docs?
Term-Document Matrix (Real-Valued) The matrix element A(i,j) is the log-scaled TFxIDF weight. Note: can be > 1.
Vector Space Model • Docs Vectors • Each doc j can now be viewed as a vector of TFxIDF values, one component for each term. • So we have a vector space • Terms are axes • Docs live in this space • May have 20,000+ dimensions • even with stemming
Vector Space Model • Prof Gerard Salton 1927-1995 The SMART information retrieval system
t3 d2 d3 d1 θ φ t1 d5 t2 d4 Vector Space Model • First application: Query-By-Example (QBE) • Given a doc d, find others “like” it. • Now that d is a vector, find vectors (docs) “near” it. Postulate: Documents that are “close together” in the vector space talk about the same things.
Vector Space Model • QueriesVectors • Regard a query as a (very short) document. • Return the docs ranked by the closeness of their vectors to the query, also represented as a vector.
Desiderata for Proximity • If d1 is near d2, then d2 is near d1. • If d1 near d2, and d2 near d3, then d1 is not far from d3. • No doc is closer to d than d itself.
Euclidean Distance • Distance between dj and dk is • Why is this not a great idea? • We still haven’t dealt with the issue of length normalization: long documents would be more similar to each other by virtue of length, not topic. • However, we can implicitly normalize by looking at angles instead.
Cosine Similarity • Vector Normalization • A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L2 norm • This maps vectors onto the unit sphere. • Then longer documents don’t get more weight.
Cosine Similarity • Cosine of angle between two vectors The denominator involves the lengths of the vectors. This means normalization.
t 3 d 2 d 1 θ t 1 t 2 Cosine Similarity • The similarity between dj and dk is captured by the cosine of the angle between their vectors. No triangle inequality for similarity.
Cosine Similarity - Exercise • Rank the following by decreasing cosine similarity: • Two docs that have only frequent words (the, a, an, of) in common. • Two docs that have no words in common. • Two docs that have many rare words in common (wingspan, tailfin).
Cosine Similarity - Exercise • Show that, for normalized vectors, Euclidean distance measure gives the same proximity ordering as the cosine similarity measure.
Cosine Similarity - Example • Docs • Austen's Sense and Sensibility (SaS) • Austen's Pride and Prejudice (PaP) • Bronte's Wuthering Heights (WH) cos(SaS, PaP) = 0.996 x 0.993 + 0.087 x 0.120 + 0.017 x 0.000 = 0.999 cos(SaS, WH) = 0.996 x 0.847 + 0.087 x 0.466 + 0.017 x 0.254 = 0.889
Vector Space Model - Summary • What’s the real point of using vector space? • Every query can be viewed as a (very short) doc. • Every query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to the query. • It provides a natural measure of scores/ranking – no longer Boolean. • Docs (and queries) are expressed as bags of words.
Vector Space Model - Exercise • How would you augment the inverted index built in previous lectures to support cosine ranking computations? • Walk through the steps of serving a query using the Vector Space Model.
Efficient Cosine Ranking • Computing a single cosine • For every term t, with each doc d, Add tft,d to postings lists. • Some tradeoffs on whether to store term count, term weight, or weighted by IDF. • At query time, accumulate component-wise sum.
Efficient Cosine Ranking • Computing the k largest cosines • Search as a kNN problem • Find the k docs “nearest” to the query (with largest query-doc cosines) in the vector space. • Do not need to totally order all docs in the corpus. • Use heap for selecting top k docs • Binary tree in which each node’s value > values of children • Takes 2n operations to construct, then each of k“winners” read off in with logn steps. • For n=1M, k=100, this is about 10% of the cost of complete sorting.
Efficient Cosine Ranking • Heuristics • Avoid computing cosines from query to each of n docs, but may occasionally get an answer wrong. • For example, cluster pruning.
Take Home Messages • TFxIDF • Vector Space Model • docs and queries as vectors • cosine similarity • efficient cosine ranking