840 likes | 1.05k Views
Relevance Ranking. Introduction. In relational algebra, the response to a query is always an unordered set of qualifying tuples. Keyword queries are not precise Rate each document for how likely it is to satisfy the user’s information need. Present the results in a ranked list.
E N D
Introduction • In relational algebra, the response to a query is always an unordered set of qualifying tuples. • Keyword queries are not precise • Rate each document for how likely it is to satisfy the user’s information need. • Present the results in a ranked list.
Relevance ranking • Recall and Precision • The Vector-Space Model • A broad class of ranking algorithms based on this model • Relevance Feedback and Rocchio’s Method • Probabilistic Relevance Feedback Models • Advanced Issues
Measures for a search engine How fast does it index Number of documents/hour How fast does it search Latency as a function of index size Expressiveness of query language Ability to express complex information needs Uncluttered UI Is it free?
Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a user happy Need a way of quantifying user happiness
Happiness: elusive to measure Most common proxy: relevance of search results But how do you measure relevance? Relevant measurement requires 3 elements: A benchmark document collection A benchmark suite of queries A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more-than-binary, but not the standard
Evaluating an IR system Note: the information need is translated into a query Relevance is assessed relative to the information neednot thequery E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective You evaluate whether the doc addresses the information need, not whether it has these words
Standard relevance benchmarks TREC - National Institute of Standards and Technology (NIST) has run a large IR test bed for many years Human experts mark, for each query and for each doc, Relevant or Nonrelevant
Unranked retrieval evaluation:Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Should we instead use the accuracy measure for evaluation? Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of these classifications that are correct Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR?
Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget…. People doing information retrieval want to findsomething and have a certain tolerance for junk. Snoogle.com Search for: 0 matching results found.
Precision/Recall You can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation
Evaluating ranked results • Model • For each query q, an exhaustive set of relevant documents DqD is identified • A query q is submitted to the query system • A ranked list of documents (d1, d2,…, dn) is returned, • Corresponding to the list, we can compute a 0/1 relevance list (r1, r2,…, rn) • ri=1 if di Dq, and ri=0 if diDq • The recall at rank k1 is defined as • The precision at rank k is defined as
Evaluating ranked results • Example • Dq={d1, d5, d7, d10} • Retrieved documents: (d1, d10,d15, d5, d4, d7, d22, d2) • Then the relevance list: (1, 1, 0, 1, 0, 1, 0, 0) • Recall(2)=(1/4)*(1+1)=0.5 • Recall(5)=(1/4)*(1+1+0+1+0)=0.75 • Recall (6)=(1/4)*(1+1+0+1+0+1)=1 • Precision(2)=(1/2)*(1+1)=1 • Precision(5)=(1/5)*(1+1+0+1+0)=0.75 • Precision(6)=(1/6)*(1+1+0+1+0+1)=2/3
Another figure of merit: Average precision • The sum of the precision at each relevant hit position in the response list, divided by the total number of relevant documents. • The average precision is 1 only if the engine retrieves all relevant documents and ranks them ahead of any irrelevant document.
Average precision • Example • Dq={d1, d5, d7, d10} • Retrieved documents: (d1, d10,d15, d5, d4, d7, d22, d2) • relevance list: (1, 1, 0, 1, 0, 1, 0, 0) Average precision=(1/|4|)*(1+1+3/4+4/6)
Evaluation at large search engines • For a large corpus in rapid flux, such as the web, it is impossible to determine Dq. • Recall is difficult to measure on the web • Search engines often use precision at top k, e.g., k = 10 • . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.
Evaluation at large search engines • Search engines also use non-relevance-based measures. • Clickthrough on first result • Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate. • A/B testing
A/B testing Purpose: Test a single innovation Prerequisite: You have a large search engine up and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to the new system that includes the innovation Evaluate with an “automatic” measure like clickthrough on first result Now we can directly see if the innovation does improve user happiness. Probably the evaluation methodology that large search engines trust most
Problem with Boolean search: feast or famine Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits It takes skill to come up with a query that produces a manageable number of hits. With a ranked list of documents it does not matter how large the retrieved set is.
Scoring as the basis of ranked retrieval • We wish to return in order the documents most likely to be useful to the searcher • How can we rank-order the documents in the collection with respect to a query? • Assign a score – say in [0, 1] – to each document • This score measures how well document and query “match”.
Term-document count matrices Consider the number of occurrences of a term in a document: Each document is a count vector in ℕv: a column below
The vector space model • Documents are represented as vectors in a multidimensional Euclidean space. • Each axis in this space corresponds to a term. • The coordinate of document d in the direction corresponding to term t is determined by two quantities, Term frequency and Inverse document frequency.
Term frequency tf • The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. • We want to use tf when computing query-document match scores. But how? • Raw term frequency is not what we want: • A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term. • But not 10 times more relevant. • Relevance does not increase proportionally with term frequency.
Term frequency tf • Normalization is needed! • There are lots of normalization method. • Normalize by using the sum of term counts • Tft,d=n(d,t)/(summation of the term frequency in document d ) • log frequency weight • Cornell Smart system uses the following equation to normalize the tf.
Document frequency • Rare terms are more informative than frequent terms • Recall stop words • Consider a term in the query that is rare in the collection (e.g., arachnocentric) • A document containing this term is very likely to be relevant to the query arachnocentric • → We want a high weight for rare terms like arachnocentric.
Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Which word is a better search term (and should get a higher weight)?
Document frequency Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance. → For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms. We will use document frequency (df) to capture this in the score. df ( N) is the number of documents that contain the term
Inverse document frequency • dft is the document frequency of t: the number of documents that contain t • df is a measure of the informativeness of t • We define the idf (inverse document frequency) of t by • We use log N/dft instead of N/dft to “dampen” the effect of idf. • Again IDF used by SMART system
tf-idf weighting • The tf-idf weight of a term is the product of its tf weight and its idf weight. • wt,d=tft,d*idft • Best known weighting scheme in information retrieval • Note: the “-” in tf-idf is a hyphen, not a minus sign! • Alternative names: tf.idf, tf x idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collection
example • Collection size=9 • Idft1=log((1+9)/1)=1 • Idft2=log((1+9)/5)=0.301 • Idft3=log((1+9)/6)=0.222 • tft1,d1=1+(log(1+log100))=1.477 • tft2,d2=1+(log(1+log10))=1.301 • Therefore,
cosine(query,document) The proximity measure between query and documents Unit vectors Dot product qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
Summary – vector space ranking Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user
Relevance Feedback The initial response from a search engine may not satisfy the user’s information need The average Web query is only two words long. Users can rarely express their information need within two words However, if the response list has at least some relevant documents, sophisticated users can learn how to modify their queries by adding or negating additional keywords. Relevance feedback automates this query refinement process.
Relevance Feedback : basic idea • Relevance feedback: user feedback on relevance of docs in initial set of results • User issues a (short, simple) query • The user marks some results as relevant or non-relevant. • The system computes a better representation of the information need based on feedback. • Relevance feedback can go through one or more iterations.
Relevance Feedback: Example Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html
Key concept: Centroid The centroid is the center of mass of a set of points Recall that we represent documents as points in a high-dimensional space Definition: Centroid where C is a set of documents.
Rocchio Algorithm The Rocchio algorithm uses the vector space model to pick a relevance feed-back query Rocchio seeks the query qopt that maximizes Tries to separate docs marked relevant and non-relevant
The Theoretically Best Query x x x x o x x x x x x x x o x o x x o x o o x x x non-relevant documents o relevant documents Optimal query
Rocchio 1971 Algorithm (SMART) Used in practice: Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors Different from Cr and Cnr qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically) New query moves toward relevant documents and away from irrelevant documents !
Relevance feedback on initial query Initial query x x x o x x x x x x x o x o x x o x o o x x x x x known non-relevant documents o known relevant documents Revised query
Subtleties to note • Tradeoff α vs. β/γ : If we have a lot of judged documents, we want a higher β/γ • g is commonly set to zero • It’s harder for user to give negative feedback • It’s also harder to use since relevant documents can often form tight cluster, but non-relevant documents rarely do. • Some weights in query vector can go negative • Negative term weights are ignored (set to 0)
Relevance Feedback in vector spaces We can modify the query based on relevance feedback and apply standard vector space model. Relevance feedback can improve recall and precision Relevance feedback is most useful for increasing recall in situations where recall is important Users can be expected to review results and to take time to iterate
Evaluation of relevance feedback strategies Use q0 and compute precision and recall graph Use qm and compute precision recall graph Assess on all documents in the collection Spectacular improvements, but … it’s cheating! Partly due to known relevant documents ranked higher Must evaluate with respect to documents not seen by user Use documents in residual collection (set of documents minus those assessed relevant) Measures usually then lower than for original query But a more realistic evaluation Relative performance can be validly compared
Evaluation of relevance feedback • Most satisfactory – use two collections each with their own relevance assessments • q0and user feedback from first collection • qmrun on second collection and measured • Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.