1 / 23

CSM06 Information Retrieval

CSM06 Information Retrieval. Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk. Lecture 4: OVERVIEW. Previously we looked at IR techniques that indexed a document based on the words that occur in the document

Download Presentation

CSM06 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk

  2. Lecture 4: OVERVIEW • Previously we looked at IR techniques that indexed a document based on the words that occur in the document • Some of these techniques are applied in web search engines (but VSM may not be appropriate). However, web IR can also exploit a distinctive feature of information on the web – hypertext link structure • Use of anchor text for indexing web pages • The PageRank algorithm based on link structure analysis • Other techniques for ranking web pages

  3. Challenges for IR on the Web • High volume of information • Heterogeneous information (multimedia and multilingual) • Diverse users - hence diverse information needs, and many inexperienced users • Average query length 2-5 words • Poorly structured and low quality information

  4. Scale • Projection of worldwide Internet population in 2005 = 1.07 billion users, www.clickz.com/stats/web_worldwide/ • Early in 2005 Google claimed to index over 8 billion web pages, Yahoo recently claimed 19 billion, now Google claims to index 3 times more than nearest competitor http://select.nytimes.com/gst/abstract.html?res=F30610F93E540C748EDDA00894DD404482 • Given the low overlap in search engine results for a given query, it is likely that the total number of webpages is much greater than that indexed by any single web search engine

  5. Requirements of Web Search Engine Users? • Fast response time • Some relevant results in first page; maybe less concern with getting all relevant results • Good coverage of web, at least of ‘important sites’ • Up-to-date links • Simple and intuitive to use – making queries and understanding results NB. Some of these requirements contrast with those of expert researchers using specialist information retrieval systems

  6. User Goals (Information Needs) • Queries are used to express a user’s goal (or information need), but note that the same query might be used for quite different goals (Rose and Levinson 2004)

  7. User Goals: Rose and Levinson’s classification (2004) • Navigational – wanting a specific known website • Informational – “my goal is to learn something by reading or viewing web pages” – e.g. closed and open-ended questions, advice • Resource – “my goal is to obtain a resource (not information) available on web pages” – e.g. download music, interact with online shopping service NOTE: prior to web most IR was concerned only with Informational queries

  8. User Goals: Rose and Levinson’s classification (2004) • The more a search engine understands about a user’s goal then the better results it can provide  User goals may be deduced not only from the query, but also from • The results returned by the search engine • Results clicked on by the user • Further searches / actions by the user

  9. Opportunity… • Web search engines can exploit the fact that information on the web is in the form of hypertext…

  10. Hypertext • The web is, in some senses at least, hypertextual, i.e. it can be viewed as networks of nodes (e.g. pages) and links (between pages)

  11. Hypertext • Links suggest – relatedness of topic / perhaps also a recommendation • Topological information about the hypertext graph gained by link structure analysis can be exploited for ranking

  12. Use of Anchor Text (Brin and Page 1998) • Words in the anchor text can be used to index the webpage being linked to – the text in an anchor may give a good description of the page it points to, e.g. <ahref=“www.bio.com/beckhambio.html"> A Biography of David Beckham</a></p> • The words in the anchor text might be a better indicator of what the webpage is about than the words in the webpage • Anchor text is also good for resources like images that can not be analysed as keywords

  13. PageRank (Brin and Page 1998) • “Google makes use of both link structure and anchor text” • “The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines”  PageRank is “an objective measure of [a web page’s] citation importance that corresponds well with people’s subjective idea of importance”

  14. Calculating PageRank PR(A) = (1-d) + d*(PR(T1)/C(T1) + … + PR(Tn)/C(Tn) PR(A) = PageRank of webpage A C (A) = the number of links out of webpage A T1…Tn = the webpages that point to webpage A d = a damping factor set between 0-1 In reality, the calculation of PageRank is iterative

  15. Web-adjacency Analysis (a similar idea to PageRank) • Kleinberg and colleagues proposed a method for identifying authoritative web-pages • Identify set of relevant pages (as normal) • Identify those with a large in-degree, i.e. lots of pages point to them (cf. ‘impact’) • Ensure that the authorities selected are referred to by a number of the same hubs, i.e. those with a large out-degree

  16. Web-adjacency Analysis • “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” (Kleinberg 1998) • Computing authority and hub values for web-pages is an iterative process over a graph, where each node is a web-page • Two weights are given to each node relating to in-degree and out-degree: total in-degree weights and total out-degree weights are kept constant • Weights are modified each iteration depending on weights of connected nodes

  17. Some other Factors used to rank Web Pages (Hock 2001) • Popularity of the Page: measured either by how many other web-pages link to it, or by how many people have clicked on it when they had the same query • Frequency of search terms: need to consider length of the document, and web-page authors attempts to affect ranking by deliberate repetition • Number of query terms matched: but remember many queries are only one or two words

  18. Other Factors (continued…) • Rarity of terms: rank pages containing rare search terms more highly (cf. TFIDF) • Weighting by Field: give high ranking to pages including search terms in important fields, e.g. Title • Proximity of Terms: rank pages more highly if search terms occur near one another • Order of Query Terms: give priority to pages containing the search term entered first

  19. Set Reading for Lecture 4 • Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. SECTIONS 1 and 2. Explains Google’s use of anchor text and PageRank. www-db.stanford.edu/~backrub/google.html • Hock (2001), The extreme searcher's guide to web search engines,pages 25-31. Gives an overview of some factors used by web search engines to rank webpages. AVAILABLE in Main Library collection and in Library Article Collection.

  20. Exercise • Explore the idea of PageRank using an online PageRank calculator, e.g. www.markhorrell.com/seo/pagerank.shtml OR www.webworkshop.net/pagerank_calculator.php3

  21. Further Reading Rose and Levinson (2004), “Understanding User Goals in Web Search”, 13th International WWW Conference, 2004. www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking: Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66 Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s work on web-adjacency analysis and authorities and hubs. Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal of the ACM. http://citeseer.nj.nec.com/87928.html Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY / ARTICLE COLLECTION. **This comprehensive article reviews a lot the ideas covered so far in this module and discusses them in the context of Web IR. NOTE, it is already a little out of date in places because of the rapid changes of the Web.

  22. Lecture 4: LEARNING OUTCOMES After this lecture you should be able to: • Explain how the challenges of web IR are different than those facing the developers of traditional IR systems • Explain how web search engines can exploit the hypertext structure of the web to index and rank web pages, e.g. using Anchor Text, and PageRank • Explain how PageRank is calculated • Discuss and critique a range of factors used by web search engines to rank web pages

  23. Reading ahead for LECTURE 5 If you want to read about next week’s lecture topics, see: Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10. http://citeseer.ist.psu.edu/dean99finding.html Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference. **Section 1 and Section 3** www.cs.columbia.edu/~eugene/papers/www10.pdf Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2). Pages 194-205. In Library Article Collection.

More Related