500 likes | 723 Views
Retrieval Models. Probabilistic and Language Models. Basic Probabilistic Model. D represented by binary vector d = (d 1 , d 2 , …d n ) where d i = 0/1 indicates absence/presence of the term d i p i = P(d i =1|R) and 1-p i = P(d i =0|R) q i = P(d i =1|NR) and 1- q i = P(d i =0|NR)
E N D
Retrieval Models Probabilistic and Language Models
Basic Probabilistic Model • D represented by binary vector d = (d1, d2, …dn) where di = 0/1 indicates absence/presence of the term di • pi = P(di =1|R) and 1-pi = P(di =0|R) • qi = P(di =1|NR) and 1- qi = P(di =0|NR) • Assume conditional independence: P(d|R) is the product of the probabilities for the components of d (i.e. product of probabilities of getting a particular vector of 1’s and 0’s) • Likelihood estimate converted to linear discrimination func g(d) = log P(d|R)/P(d|NR) + log P(R)/P(NR)
Basic Probabilistic Model • Need to calculate • “relevant to query given term appears” and • “irrelevant to query given term appears” • These values can be based upon known rel judgments • ratio of rel with term to relevant without term ratio of nonrel with term to nonrel without term • Rarely have relevance information • Estimating probabilities is the same problem as determining weighting formulae in less formal models • constant (Croft and Harper combination match) • proportional to probability of occurrence in collection • more accurately, proportional to log(probability of occurrence) • Greif, 1998
What is a Language Model? • Probability distribution over strings of a text • How likely is a given string in a given “language”? • e.g. consider probability for the following strings • English: p1 > p2 > p3 > p4 • … depends on what “language” we are modeling • In most of IR, assume that p1 == p2 • For some applications we will want p3 to be highly probable • p1 = P(“a quick brown dog”) • p2 = P(“dog quick a brown”) • p3 = P(“ бЬІсТраяbrown dog”) • p4 = P(“ бЬІсТрая СОбаКа”)
Basic Probability Review • Probabilities • P(s) = probability of event s occuring e.g. P(“moo”) • P(s|M) = probability of s occurring given M • P(“moo”|”cow”) > P(“moo”|”cat”) • (Sum of P(s) over all s) = 1, P(not s) = 1- P(s) • Independence • If events w1 and w2 are independent, then P(w1 AND w2) = P(w1) * P(w2), so … • Bayes’ rule:
Language Models (LMs) • What are we modeling? • M: “language” we are trying to model • s: observation (string of tokens from vocabulary) • P(s|M): probability of observing “s” in M • M can be thought of as a “source” or a generator • A mechanism that can create strings that are legal in M, • P(s|M) -> probability of getting “s” during random sampling from M
LMs in IR • Task: given a query, retrieve relevant documents • Use LMs to model the process of query generation • Every document in a collection defines a language • Consider all possible sentences author could have written down • Some may be more likely to occur than others • Depend upon subject, topic, writing style, language, etc • P(s|M) -> probability author would write down string “s” • Like writing a billion variations of a doc and counting # times we see “s”
LMs in IR • Now suppose “Q” is the user’s query • What is the probability that the author would write down “q”? • Rank documents D in the collection by P(Q|MD) • Probability of observing “Q” during random sampling from the language model of document D
LMs in IR • Advantages: • Formal mathematical model (theoretical foundation) • Simple, well understood framework • Integrates both indexing and retrieval models • Natural use of collection statistics • Avoids tricky issues of “relevance”, “aboutness” • Disadvantages: • Difficult to incorporate notions of “relevance”, user preferences • Relevance feedback/query expansion is not straightforward • Can’t accommodate phrases, passages, Boolean operators • There are extensions of LM which overcome some issues
Major Issues of applying LMs • What kind of LM should we use? • Unigram • Higher-order model • How can we estimate model parameters? • Can use smoothed relative frequency (counting) for estimation • How can we use the model for ranking?
Unigram LMs • Words are “sampled” independently of each other • Metaphor: randomly pulling words from an urn “with replacement” • Joint probability decomposes into a product of marginals • Estimation of probabilities: simple counting
Higher-order LMs • Unigram model assumes word independence • Cannot capture surface form: P(“fat cat”) = P(“cat fat”) • Higher-order models • N-gram: conditioning on preceding words • Cache: condition on a window • Grammar: condition on parse tree • Are they useful? • No improvements from n-gram, grammar-based models • Some work on cache-like models • Parameter estimation is prohibitively expensive!
Predominant Model is multinomial • Predominant Model is multinomial • Fundamental event: what is the identity of the i’th query token? • Observation is a sequence of events, one for each query token • Original model is multiple-Bernouli • Fundamental event: does the word w occur in the query? • Observation is vector of binary events, 1 for each possible word
Multinomial or multiple Bernouli? • Two models are fundamentally different • entirely different event spaces (random variables involved) • both assume word independence (though it has different meanings) • both use smoothed relative-frequency (counting) for estimation • Multinomial • can account for multiple word occurrences in the query • well understood: lots of research in related fields (and now in IR) • possibility for integration with ASR/MT/NLP (same event space) • Multiple-Bernoulli • arguably better suited to IR (directly checks presence of query terms) • provisions for explicit negation of query terms (“A but not B”) • no issues with observation length • Binary events, either occur or do not occur
Ranking with Language Models • Standard approach: query-likelihood • Estimate an LM, MD for every document D in the collection • Rank docs by the probability of “generating” the query from the document P(q1 … qk| MD) = P(qi| MD) • Drawbacks: • no notion of relevance: everything is random sampling • user feedback / query expansion not part of the model • examples of relevant documents cannot help us improve the LM MD • the only option is augmenting the original query Q with extra terms • however, we could make use of sample queries for which D is relevant • does not directly allow weighted or structured queries
Ranking: Document-likelihood • Flip the direction of the query-likelihood approach • estimate a language model MQ for the query Q • rank docs D by the likelihood of being a random sample from MQ • MQ expected to “predict” a typical relevant document • Problems: • different doc lengths, probabilities not comparable • favors documents that contain frequent (low content) words • consider “ideal” (highest-ranked) document for a given query
Ranking: Model Comparison • Combine advantages of 2 ranking methods • estimate a model of both the query MQ and the document MD • directly compare similarity of the two models • natural measure of similarity is cross-entropy (others exist): • Cross-entropy is not symmetric: use H(MQ||MD) • Reverse works consistently worse, favors different doc • Use reverse if ranking multiple queries wrt one document
Summary of LM • Use Unigram models • No consistent benefit from using higher order models • Estimation is much more complex (bi-gram, etc) • Use multinomial models • Well-studied, consistent with other fields • Extend multiple-Bernoulli model to non-binary events? • Use Model comparison for ranking • Allows feedback, expansion, etc • Evaluation of is a crucial step • very significant impact on performance (more than other choices) • key to cross-language, cross-media and other applications
Estimation • Want to estimate MQ and/or MD from Q and/or D • General problem: • given a string of text S (= Q or D), estimate its language model MS • S is commonly assumed to be an i.i.d. random sample from MS • Independent and identically distributed • Basic Language Models: • maximum-likelihood estimator and the zero frequency problem • discounting techniques: • Laplace correction, Lindstone correction, absolute discounting, leave one-out discounting, Good-Turing method • interpolation/back-off techniques: • Jelinek-Mercer smoothing, Dirichlet smoothing, Witten-Bell smoothing, Zhai-Lafferty two-stage smoothing, interpolation vs. back-off techniques • Bayesian estimation
Maximum Likelihood • Count relative frequencies of words in S • Pml(w| MS) = #(w,S)/|S| • maximum-likelihood property: • assigns highest possible likelihood to the observation • unbiased estimator: • if we repeat estimation an infinite number of times with different starting points S, we will get correct probabilities (on average) • this is not very useful…
The Zero-frequency problem • Suppose some event not in our observation S • Model will assign zero probability to that event • And to any set of events involving the unseen event • Happens very frequently with language • It is incorrect to infer zero probabilities • especially when creating a model from short samples
Discounting Methods • Laplace correction: • add 1 to every count, normalize • problematic for large vocabularies • Lindstone correction: • add a small constant ε to every count, re-normalize • Absolute Discounting • subtract a constant ε, re-distribute the probability mass
Discounting Methods • Smoothing: Two possible approaches • Interpolation • Adjust probabilities for all events, both seen and unseen • “interpolate” ML estimates with General English expectations (computed as relative frequency of a word in a large collection) • reflects expected frequency of events • Back-off: • Adjust probabilities only for unseen events • Leave non-zero probabilities as they are • Rescale everything to sum to one: • rescales “seen” probabilities by a constant • Interpolation tends to work better • And has a cleaner probabilistic interpretation
Types of Evaluation • Might evaluate several aspects • Evaluation generally comparative • System A vs. B • System A vs A´ • Most common evaluation - retrieval effectiveness • Assistance in formulating queries • Speed of retrieval • Resources required • Presentation of documents • Ability to find relevant documents
The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgments • Same users may judge differently at different times • Degree of relevance of different documents may vary
The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not
Relevance • In a small collection - the relevance of each document can be checked • With real collections, never know full set of relevant documents • Any retrieval model includes an implicit definition of relevance • Satisfiability of a FOL expression • Distance • P(Relevance|query,document)
potato blight … x Potato farming and nutritional value of potatoes. growing potatoes … Mr. Potato Head … nutritional info for spuds x x Evaluation • Set of queries • Collection of documents (corpus) • Relevance judgements: Which documents are correct and incorrect for each query • If small collection, can review all documents • Not practical for large collections Any ideas about how we might approach collecting relevance judgments for very large collections?
Finding Relevant Documents • Pooling • Retrieve documents using several auto techniques • Judge top n documents for each technique • Relevant set is union • Subset of true relevant set • Possible to estimate size of relevant set by sampling • When testing: • How should unjudged documents be treated? • How might this affect results?
Test Collections • To compare the performance of two techniques: • each technique used to evaluate same queries • results (set or ranked list) compared using metric • most common measures - precision and recall • Usually use multiple measures to get different views of performance • Usually test with multiple collections – • performance is collection dependent
Retrieved documents Let retrieved = 100, relevant = 25, rel & ret = 10 Rel&Ret documents Relevant documents Recall = 10/25 = .40 Ability to return ALL relevant items. Retrieved Precision = 10/100 = .10 Ability to return ONLY relevant items. Evaluation
Precision and Recall • Precision and recall well-defined for sets • For ranked retrieval • Compute value at fixed recall points (e.g. precision at 20% recall) • Compute a P/R point for each relevant document, interpolate • Compute value at fixed rank cutoffs (e.g. precision at rank 20)
Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Precision at Fixed Recall: 1 Qry Find precision given total number of docs retrieved at given recall value.
Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Precision at Fixed Recall: 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 Find precision given total number of docs retrieved at given recall value.
Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Precision at Fixed Recall: 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 30% Recall: .3 * 10 rel docs = 3 rel docs retrieved 6 docs retrieved to get 3 rel docs: precision = 3/6 = 0.5 What is precision at recall values from 40-100%?
120 Recall/ Precision Curve • • 100 80 • Precision 60 • • 40 • 20 • • • • • 0 20 40 60 80 100 120 Recall • |Rq|= 10, no. of relevant docs for q • Ranking of retreived docs in the answer set of q: Recall Precision 0.1 1/1 = 100% 0.2 2/3 = 0.67% 0.3 3/6 = 0.5% 0.4 4/10 = 0.4% 0.5 5/15 = 0.33% 0.6 0% … … 1.0 0%
Averaging • Hard to compare individual P/R graphs or tables • microaverage - each rel doc is a point in the avg • done with respect to a parameter • e.g. coordination level matching (# of shared query terms) • average across the total number of relevant documents across each match level • Let L = relevant docs, T = retrieved docs, = coordination level = total # relevant docs for all queries = total # retrieved docs at or above for all queries
Averaging • Let 100 & 80 be the # of relevant docs for queries 1 &2, respectively • calculate actual recall and precision values for both queries Hard to compare individual Prec/Recall tables, so take averages: There are 80+100 = 180 relevant docs for all queries 10+8/180 = 0.1 10+8/10+10 = 0.9 20+24/180 = 0.24 20+24/25+40 = 0.68 40+40/180 = 0.44 60+56/180=0.64 80+72/180=0.84 40+40/66+80 = 0.55 60+56/290 = 0.4 80+72/446 = 0.34
Averaging and Interpolation • macroaverage - each query is a point in the avg • can be independent of any parameter • average of precision values across several queries at standard recall levels e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 (why?) • their precision is .25, .22, and .15 (why?) • Average over all relevant docs • rewards systems that retrieve relevant docs at the top (.25+.22+.15)/3= 0.21
Averaging and Interpolation • Interpolation • actual recall levels of individual queries are seldom equal to standard levels • interpolation estimates the best possible performance value between two known values • e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their precision at actual recall is .25, .22, and .15
Averaging and Interpolation • Actual recall levels of individual queries are seldom equal to standard levels • Interpolated precision at the ith recall level, Ri, is the maximum precision at all points p such that Ri p Ri+1 • assume only 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 • their precision is .25, .22, and .15 • what is interpolated precision at standard recall points? Recall levelInterpolated Precision 0.0, 0.1, 0.2, 0.3 0.25 0.4, 0.5, 0.6 0.22 0.7, 0.8, 0.9, 1.0 0.15
Document Level Averages • Precision after a given number of docs retrieved • e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents • Reflects the actual system performance as a user might see it • Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries • e.g. average precision for all queries at the point where n docs have been retrieved
R-Precision • Precision after R documents are retrieved • R = number of relevant docs for the query • Average R-Precision • mean of the R-Precisions across all queries e.g.) Assume 2 qrys having 50 & 10 relevant docs; system retrieves 17 and 7 relevant docs in the top 50 and 10 documents retrieved, respectively
Evaluation • Recall-Precision value pairs may co-vary in ways that are hard to understand • Would like to find composite measures • A single number measure of effectiveness • primarily ad hoc and not theoretically justifiable • Some attempt to invent measures that combine parts of the contingency table into a single number measure
Symmetric Difference A is the retrieved set of documents B is the relevant set of documents A B (the symmetric difference) is the shaded area
E measure (van Rijsbergen) • used to emphasize precision or recall • like a weighted average of precision and recall • large a increases importance of precision • can transform by a = 1/(b2 +1), b = P/R • when a = 1/2, b = 1; precision and recall are equally important E= normalized symmetric difference of retrieved and relevant sets E b=1 = |A B|/(|A| + |B|) • F =1- E is typical (good results mean larger values of F)
Other Single-Valued Measures • Breakeven point • point at which precision = recall • Expected search length • save users from having to look thru non-relevant docs • Swets model • use statistical decision theory to express recall, precision, and fallout in terms of conditional probabilities • Utility measures • assign costs to each cell in the contingency table • sum (or average) costs for all queries • Many others...