CSM06 Information Retrieval

CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway@surrey.ac.uk

Lecture 4: OVERVIEW • Previously we looked at IR techniques that indexed a document based on the words that occur in the document • Some of these techniques are applied in web search engines (but VSM may not be appropriate). However, web IR can also exploit a distinctive feature of information on the web – hypertext link structure • Use of anchor text for indexing web pages • The PageRank algorithm based on link structure analysis • Other techniques for ranking web pages

Challenges for IR on the Web • High volume of information • Heterogeneous information (multimedia and multilingual) • Diverse users - hence diverse information needs, and many inexperienced users • Average query length 2-5 words • Poorly structured and low quality information

Scale • Projection of worldwide Internet population in 2005 = 1.07 billion users, www.clickz.com/stats/web_worldwide/ • Early in 2005 Google claimed to index over 8 billion web pages, Yahoo recently claimed 19 billion, now Google claims to index 3 times more than nearest competitor http://select.nytimes.com/gst/abstract.html?res=F30610F93E540C748EDDA00894DD404482 • Given the low overlap in search engine results for a given query, it is likely that the total number of webpages is much greater than that indexed by any single web search engine

Requirements of Web Search Engine Users? • Fast response time • Some relevant results in first page; maybe less concern with getting all relevant results • Good coverage of web, at least of ‘important sites’ • Up-to-date links • Simple and intuitive to use – making queries and understanding results NB. Some of these requirements contrast with those of expert researchers using specialist information retrieval systems

User Goals (Information Needs) • Queries are used to express a user’s goal (or information need), but note that the same query might be used for quite different goals (Rose and Levinson 2004)

User Goals: Rose and Levinson’s classification (2004) • Navigational – wanting a specific known website • Informational – “my goal is to learn something by reading or viewing web pages” – e.g. closed and open-ended questions, advice • Resource – “my goal is to obtain a resource (not information) available on web pages” – e.g. download music, interact with online shopping service NOTE: prior to web most IR was concerned only with Informational queries

User Goals: Rose and Levinson’s classification (2004) • The more a search engine understands about a user’s goal then the better results it can provide  User goals may be deduced not only from the query, but also from • The results returned by the search engine • Results clicked on by the user • Further searches / actions by the user

Opportunity… • Web search engines can exploit the fact that information on the web is in the form of hypertext…

Hypertext • The web is, in some senses at least, hypertextual, i.e. it can be viewed as networks of nodes (e.g. pages) and links (between pages)

Hypertext • Links suggest – relatedness of topic / perhaps also a recommendation • Topological information about the hypertext graph gained by link structure analysis can be exploited for ranking

Use of Anchor Text (Brin and Page 1998) • Words in the anchor text can be used to index the webpage being linked to – the text in an anchor may give a good description of the page it points to, e.g. <ahref=“www.bio.com/beckhambio.html"> A Biography of David Beckham</a></p> • The words in the anchor text might be a better indicator of what the webpage is about than the words in the webpage • Anchor text is also good for resources like images that can not be analysed as keywords

PageRank (Brin and Page 1998) • “Google makes use of both link structure and anchor text” • “The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines”  PageRank is “an objective measure of [a web page’s] citation importance that corresponds well with people’s subjective idea of importance”

Calculating PageRank PR(A) = (1-d) + d*(PR(T1)/C(T1) + … + PR(Tn)/C(Tn) PR(A) = PageRank of webpage A C (A) = the number of links out of webpage A T1…Tn = the webpages that point to webpage A d = a damping factor set between 0-1 In reality, the calculation of PageRank is iterative

Web-adjacency Analysis (a similar idea to PageRank) • Kleinberg and colleagues proposed a method for identifying authoritative web-pages • Identify set of relevant pages (as normal) • Identify those with a large in-degree, i.e. lots of pages point to them (cf. ‘impact’) • Ensure that the authorities selected are referred to by a number of the same hubs, i.e. those with a large out-degree

Web-adjacency Analysis • “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” (Kleinberg 1998) • Computing authority and hub values for web-pages is an iterative process over a graph, where each node is a web-page • Two weights are given to each node relating to in-degree and out-degree: total in-degree weights and total out-degree weights are kept constant • Weights are modified each iteration depending on weights of connected nodes

Some other Factors used to rank Web Pages (Hock 2001) • Popularity of the Page: measured either by how many other web-pages link to it, or by how many people have clicked on it when they had the same query • Frequency of search terms: need to consider length of the document, and web-page authors attempts to affect ranking by deliberate repetition • Number of query terms matched: but remember many queries are only one or two words

Other Factors (continued…) • Rarity of terms: rank pages containing rare search terms more highly (cf. TFIDF) • Weighting by Field: give high ranking to pages including search terms in important fields, e.g. Title • Proximity of Terms: rank pages more highly if search terms occur near one another • Order of Query Terms: give priority to pages containing the search term entered first

Set Reading for Lecture 4 • Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. SECTIONS 1 and 2. Explains Google’s use of anchor text and PageRank. www-db.stanford.edu/~backrub/google.html • Hock (2001), The extreme searcher's guide to web search engines,pages 25-31. Gives an overview of some factors used by web search engines to rank webpages. AVAILABLE in Main Library collection and in Library Article Collection.

Exercise • Explore the idea of PageRank using an online PageRank calculator, e.g. www.markhorrell.com/seo/pagerank.shtml OR www.webworkshop.net/pagerank_calculator.php3

Further Reading Rose and Levinson (2004), “Understanding User Goals in Web Search”, 13th International WWW Conference, 2004. www.sims.berkeley.edu/courses/is141/f05/readings/rose_www04.pdf Page, Brin, Motwani and Winograd (1999), “The PageRank Citation Ranking: Bringing Order to the Web.” http://dbpubs.stanford.edu:8090/pub/1999-66 Belew (2000), Finding Out About, pages 195-199 for an overview of Kleinberg’s work on web-adjacency analysis and authorities and hubs. Kleinberg (1998), ‘Authoritative Sources in a Hyperlinked Environment’, Journal of the ACM. http://citeseer.nj.nec.com/87928.html Kobayashi and Takeda (2000), “Information Retrieval on the Web”, ACM Computing Surveys 32(2), pp. 144-173. AVAILABLE IN LIBRARY / ARTICLE COLLECTION. **This comprehensive article reviews a lot the ideas covered so far in this module and discusses them in the context of Web IR. NOTE, it is already a little out of date in places because of the rapid changes of the Web.

Lecture 4: LEARNING OUTCOMES After this lecture you should be able to: • Explain how the challenges of web IR are different than those facing the developers of traditional IR systems • Explain how web search engines can exploit the hypertext structure of the web to index and rank web pages, e.g. using Anchor Text, and PageRank • Explain how PageRank is calculated • Discuss and critique a range of factors used by web search engines to rank web pages

Reading ahead for LECTURE 5 If you want to read about next week’s lecture topics, see: Dean and Henzinger (1999), ‘Finding Related Pages in the World Wide Web’. Pages 1-10. http://citeseer.ist.psu.edu/dean99finding.html Agichtein, Lawrence and Gravano (2001), ‘Learning Search Engine Specific Query Transformations for Question Answering’, Procs. 10th International WWW Conference. **Section 1 and Section 3** www.cs.columbia.edu/~eugene/papers/www10.pdf Oppenheim, Morris and McKnight (2000), ‘The Evaluation of WWW Search Engines’, Journal of Documentation, 56(2). Pages 194-205. In Library Article Collection.

CSM06 Information Retrieval