1 / 41

INF 141: Information Retrieval

Learn about NDCG and its importance in evaluating the effectiveness and efficiency of search engines. Explore the Vector Space Model and Language Models for retrieval.

munozs
Download Presentation

INF 141: Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF 141: Information Retrieval Discussion Session Week 8 – Winter 2010 TA: Sara Javanmardi

  2. Assignment 5 • Ranking • You can do it in groups of 1, 2 or 3

  3. General Questions Question 1 : Calculating NDCG

  4. Evaluation • Evaluation is key to building effective and efficient search engines • measurement usually carried out in controlled laboratory experiments • online testing can also be done • Effectiveness, efficiency and cost are related • e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration • efficiency and cost targets may impact effectiveness

  5. Evaluation • Precision • Recall • NDCG

  6. Effectiveness Measures A is set of relevant documents, B is set of retrieved documents

  7. Problems Users look at only top part of ranked results. Precision at rank p (p =10) Problem: Order of the ranked results is not important.

  8. Discounted Cumulative Gain • Popular measure for evaluating web search and related tasks • Two assumptions: • Highly relevant documents are more useful than marginally relevant document • the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined

  9. Discounted Cumulative Gain • Uses graded relevance as a measure of the usefulness, or gain, from examining a document • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks • Typical discount is 1/log (rank) • With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3

  10. Discounted Cumulative Gain • DCG is the total gain accumulated at a particular rank p: • Alternative formulation: • used by some web search companies • emphasis on retrieving highly relevant documents

  11. fair fair Good

  12. Normalized DCG • DCG numbers are averaged across a set of queries at specific rank values • DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking • makes averaging easier for queries with different numbers of relevant documents

  13. NDCG Example • Labels • Perfect (5) • Excellent (4) • Good (3) • Fair (2) • Bad (0) -> No gain for bad • Perfect ranking: • Good, fair, fair • Calculate DCG for fair, fair, good then divide by the perfect ranking to normalize

  14. General Questions Question 2 & 3 : Retrieval Models 1) Vector Space Model 2) Language Models

  15. Vector Space Model • 3-d pictures useful, but can be misleading for high-dimensional space

  16. Vector Space Model • Documents ranked by distance between points representing query and documents • Similarity measure more common than a distance or dissimilarity measure • e.g. Cosine correlation

  17. Similarity Calculation • Consider two documents D1, D2 and a query Q • D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

  18. Term Weights • tf.idf weight • Term frequency weight measures importance in document: • Inverse document frequency measures importance in collection: • Some heuristic modifications

  19. Language Model • Unigram language model • probability distribution over the words in a language • generation of text consists of pulling words out of a “bucket” according to the probability distribution and replacing them • N-gram language model • some applications use bigram and trigram language models where probabilities depend on previous words

  20. LMs for Retrieval • 3 possibilities: • probability of generating the query text from a document language model • probability of generating the document text from a query language model • comparing the language models representing the query and document topics • Models of topical relevance

  21. Query-Likelihood Model • Rank documents by the probability that the query could be generated by the document model (i.e. same topic) • Given query, start with P(D|Q) • Using Bayes’ Rule • Assuming prior is uniform, unigram model

  22. Estimating Probabilities • Obvious estimate for unigram probabilities is • Maximum likelihood estimate • makes the observed value of fqi;D most likely • If query words are missing from document, score will be zero • Missing 1 out of 4 query words same as missing 3 out of 4

  23. Smoothing • Document texts are a sample from the language model • Missing words should not have zero probability of occurring • Smoothing is a technique for estimating probabilities for missing (or unseen) words • lower (or discount) the probability estimates for words that are seen in the document text • assign that “left-over” probability to the estimates for the words that are not seen in the text

  24. Estimating Probabilities • Estimate for unseen words is αDP(qi|C) • P(qi|C) is the probability for query word iin the collection language model for collection C (background probability) • αD is a parameter • Estimate for words that occur is (1 − αD) P(qi|D) + αD P(qi|C) • Different forms of estimation come from different αD

  25. Jelinek-Mercer Smoothing • αD is a constant, λ • Gives estimate of • Ranking score • Use logs for convenience • accuracy problems multiplying small numbers

  26. Where is tf.idf Weight? - proportional to the term frequency, inversely proportional to the collection frequency

  27. Dirichlet Smoothing • αD depends on document length • Gives probability estimation of • and document score

  28. Query Likelihood Example • For the term “president” • fqi,D= 15, cqi= 160,000 • For the term “lincoln” • fqi,D= 25, cqi= 2,400 • number of word occurrences in the document |d| is assumed to be 1,800 • number of word occurrences in the collection is 109 • 500,000 documents times an average of 2,000 words • μ = 2,000

  29. Query Likelihood Example • Negative number because summing logs of small numbers

  30. Query Likelihood Example

  31. General Questions Question 3 : PageRank

  32. PageRank • Billions of web pages, some more informative than others • Links can be viewed as information about the popularity (authority?) of a web page • can be used by ranking algorithm • Inlink count could be used as simple measure • Link analysis algorithms like PageRank provide more reliable ratings • less susceptible to link spam

  33. Random Surfer Model • Browse the Web using the following algorithm: • Choose a random number r between 0 and 1 • If r < λ: • Go to a random page • If r ≥ λ: • Click a link at random on the current page • Start again • PageRank of a page is the probability that the “random surfer” will be looking at that page • links from popular pages will increase PageRank of pages they point to

  34. Dangling Links • Random jump prevents getting stuck on pages that • do not have links • contains only links that no longer point to other pages • have links forming a loop • Links that point to the first two types of pages are called dangling links • may also be links to pages that have not yet been crawled

  35. PageRank • PageRank (PR) of page C = PR(A)/2 + PR(B)/1 • More generally, • where Bu is the set of pages that point to u, and Lv is the number of outgoing links from page v (not counting duplicate links)

  36. PageRank • Don’t know PageRank values at start • Assume equal values (1/3 in this case), then iterate: • first iteration: PR(C) = 0.33/2 + 0.33 = 0.5, PR(A) = 0.33, and PR(B) = 0.17 • second: PR(C) = 0.33/2 + 0.17 = 0.33, PR(A) = 0.5, PR(B) = 0.17 • third: PR(C) = 0.42, PR(A) = 0.33, PR(B) = 0.25 • Converges to PR(C) = 0.4, PR(A) = 0.4, and PR(B) = 0.2

  37. PageRank • Taking random page jump into account, 1/3 chance of going to any page when r < λ • PR(C) = λ/3 + (1 − λ) · (PR(A)/2 + PR(B)/1) • More generally, • where N is the number of pages, λ typically 0.15

  38. Link Quality • Link quality is affected by spam and other factors • e.g., link farms to increase PageRank • trackback links in blogs can create loops • links from comments section of popular blogs • Blog services modify comment links to contain rel=nofollow attribute • e.g., “Come visit my <a rel=nofollow href="http://www.page.com">web page</a>.”

  39. Trackback Links

  40. Wikipedia:Most vandalized pages • This page lists some articles that have undergone repeated vandalism. • Manual Vandalism Detection: • “please watch the following articles closely and revert any vandalism as appropriate.” Changes related to "Wikipedia:Most vandalized pages“ Language Model distance http://en.wikipedia.org/w/index.php?title=President_pro_tempore_of_the_Oklahoma_Senate&diff=prev&oldid=346008144 Repeating characters http://en.wikipedia.org/w/index.php?title=Flag_of_Berlin&diff=prev&oldid=346006207 http://en.wikipedia.org/w/index.php?title=Flag_of_Berlin&diff=prev&oldid=346005984 References removed

  41. Cosine Similarity http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

More Related