INF 141: Information Retrieval

INF 141: Information Retrieval Discussion Session Week 8 – Winter 2010 TA: Sara Javanmardi

Assignment 5 • Ranking • You can do it in groups of 1, 2 or 3

General Questions Question 1 : Calculating NDCG

Evaluation • Evaluation is key to building effective and efficient search engines • measurement usually carried out in controlled laboratory experiments • online testing can also be done • Effectiveness, efficiency and cost are related • e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration • efficiency and cost targets may impact effectiveness

Evaluation • Precision • Recall • NDCG

Effectiveness Measures A is set of relevant documents, B is set of retrieved documents

Problems Users look at only top part of ranked results. Precision at rank p (p =10) Problem: Order of the ranked results is not important.

Discounted Cumulative Gain • Popular measure for evaluating web search and related tasks • Two assumptions: • Highly relevant documents are more useful than marginally relevant document • the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

Discounted Cumulative Gain • Uses graded relevance as a measure of the usefulness, or gain, from examining a document • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks • Typical discount is 1/log (rank) • With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

Discounted Cumulative Gain • DCG is the total gain accumulated at a particular rank p: • Alternative formulation: • used by some web search companies • emphasis on retrieving highly relevant documents

fair fair Good

Normalized DCG • DCG numbers are averaged across a set of queries at specific rank values • DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking • makes averaging easier for queries with different numbers of relevant documents

NDCG Example • Labels • Perfect (5) • Excellent (4) • Good (3) • Fair (2) • Bad (0) -> No gain for bad • Perfect ranking: • Good, fair, fair • Calculate DCG for fair, fair, good then divide by the perfect ranking to normalize

General Questions Question 2 & 3 : Retrieval Models 1) Vector Space Model 2) Language Models

Vector Space Model • 3-d pictures useful, but can be misleading for high-dimensional space

Vector Space Model • Documents ranked by distance between points representing query and documents • Similarity measure more common than a distance or dissimilarity measure • e.g. Cosine correlation

Similarity Calculation • Consider two documents D1, D2 and a query Q • D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

Term Weights • tf.idf weight • Term frequency weight measures importance in document: • Inverse document frequency measures importance in collection: • Some heuristic modifications

Language Model • Unigram language model • probability distribution over the words in a language • generation of text consists of pulling words out of a “bucket” according to the probability distribution and replacing them • N-gram language model • some applications use bigram and trigram language models where probabilities depend on previous words

LMs for Retrieval • 3 possibilities: • probability of generating the query text from a document language model • probability of generating the document text from a query language model • comparing the language models representing the query and document topics • Models of topical relevance

Query-Likelihood Model • Rank documents by the probability that the query could be generated by the document model (i.e. same topic) • Given query, start with P(D|Q) • Using Bayes’ Rule • Assuming prior is uniform, unigram model

Estimating Probabilities • Obvious estimate for unigram probabilities is • Maximum likelihood estimate • makes the observed value of fqi;D most likely • If query words are missing from document, score will be zero • Missing 1 out of 4 query words same as missing 3 out of 4

Smoothing • Document texts are a sample from the language model • Missing words should not have zero probability of occurring • Smoothing is a technique for estimating probabilities for missing (or unseen) words • lower (or discount) the probability estimates for words that are seen in the document text • assign that “left-over” probability to the estimates for the words that are not seen in the text

Estimating Probabilities • Estimate for unseen words is αDP(qi|C) • P(qi|C) is the probability for query word iin the collection language model for collection C (background probability) • αD is a parameter • Estimate for words that occur is (1 − αD) P(qi|D) + αD P(qi|C) • Different forms of estimation come from different αD

Jelinek-Mercer Smoothing • αD is a constant, λ • Gives estimate of • Ranking score • Use logs for convenience • accuracy problems multiplying small numbers

Where is tf.idf Weight? - proportional to the term frequency, inversely proportional to the collection frequency

Dirichlet Smoothing • αD depends on document length • Gives probability estimation of • and document score

Query Likelihood Example • For the term “president” • fqi,D= 15, cqi= 160,000 • For the term “lincoln” • fqi,D= 25, cqi= 2,400 • number of word occurrences in the document |d| is assumed to be 1,800 • number of word occurrences in the collection is 109 • 500,000 documents times an average of 2,000 words • μ = 2,000

Query Likelihood Example • Negative number because summing logs of small numbers

Query Likelihood Example

General Questions Question 3 : PageRank

PageRank • Billions of web pages, some more informative than others • Links can be viewed as information about the popularity (authority?) of a web page • can be used by ranking algorithm • Inlink count could be used as simple measure • Link analysis algorithms like PageRank provide more reliable ratings • less susceptible to link spam

Random Surfer Model • Browse the Web using the following algorithm: • Choose a random number r between 0 and 1 • If r < λ: • Go to a random page • If r ≥ λ: • Click a link at random on the current page • Start again • PageRank of a page is the probability that the “random surfer” will be looking at that page • links from popular pages will increase PageRank of pages they point to

Dangling Links • Random jump prevents getting stuck on pages that • do not have links • contains only links that no longer point to other pages • have links forming a loop • Links that point to the first two types of pages are called dangling links • may also be links to pages that have not yet been crawled

PageRank • PageRank (PR) of page C = PR(A)/2 + PR(B)/1 • More generally, • where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v (not counting duplicate links)

PageRank • Don’t know PageRank values at start • Assume equal values (1/3 in this case), then iterate: • first iteration: PR(C) = 0.33/2 + 0.33 = 0.5, PR(A) = 0.33, and PR(B) = 0.17 • second: PR(C) = 0.33/2 + 0.17 = 0.33, PR(A) = 0.5, PR(B) = 0.17 • third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25 • Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) = 0.2

PageRank • Taking random page jump into account, 1/3 chance of going to any page when r < λ • PR(C) = λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1) • More generally, • where N is the number of pages, λ typically 0.15

Link Quality • Link quality is affected by spam and other factors • e.g., link farms to increase PageRank • trackback links in blogs can create loops • links from comments section of popular blogs • Blog services modify comment links to contain rel=nofollow attribute • e.g., “Come visit my <a rel=nofollow href="http://www.page.com">web page</a>.”

Trackback Links

Wikipedia:Most vandalized pages • This page lists some articles that have undergone repeated vandalism. • Manual Vandalism Detection: • “please watch the following articles closely and revert any vandalism as appropriate.” Changes related to "Wikipedia:Most vandalized pages“ Language Model distance http://en.wikipedia.org/w/index.php?title=President_pro_tempore_of_the_Oklahoma_Senate&diff=prev&oldid=346008144 Repeating characters http://en.wikipedia.org/w/index.php?title=Flag_of_Berlin&diff=prev&oldid=346006207 http://en.wikipedia.org/w/index.php?title=Flag_of_Berlin&diff=prev&oldid=346005984 References removed

Cosine Similarity http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

INF 141: Information Retrieval