The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-ScaleHypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005

Agenda • Introduction • Overview of Google • PageRank • Motivation & Description • Example • Issues & Comparison • Further Work • Application • Conclusions

Introduction • About the paper • Brin & Page, 1998, Stanford University • Details a prototype search engine, Google • Covers both architecture and algorithms • Cited in web metrics with relation to significance • Also relevant to Web Graph Properties • PageRank • Covered in a separate paper from Brin & Page • Is the primary metric used in the paper

Overview : What is Google? • Web search engine • Tackles issues faced by previous crawlers of scalability and manipulation • Academic • Built on strong understanding of web metrics • Use of hyperlink structures • Transparent • Initially released into the public domain • Support for informatics research

Crawler Barrels Sorter Overview : Architecture URL Server Store Server Anchors Repository Check sums URL Resolver Indexer Links Doc Index Lexicon Searcher PageRank

Overview: Google Architecture (Explanation for handout only.) • URL Server: Finds pages to surf. • Crawler: Downloads pages and places them in the repository. • Store Server: Document compression. • Repository: Cached copies of most web pages. • Indexer: Creates the forward index (documents  words) and extracts hyperlink tags into the Anchors file. • URL Resolver: Converts relative URLs into absolute URLs and creates the Links file. • Links file: Ordered pairs of document IDs where a hyperlink exists between them. • Sorter: Re-sorts the forward index to create the inverted index (words  documents) and creates the Lexicon. • Lexicon: Dictionary of all possible search keywords. • Doc Index: Maps document identifier codes to URLs. • PageRank: An influential web metric used to sort Google’s matches. • Searcher: Performs searches!

Overview : Forward Index • Indexer identifies key word ‘hits’ in a document • Maps document (page) ID’s to word ID’s in Lexicon • Word ID’s partially sorted into barrels • 64 of these • Word ID’s within a barrel are unsorted. • Individual document may spread over barrels. • However, not useful for search!

Overview : Inverted Index • Want to know in what documents a key word occurs • Need the ‘Inverted Index’ • Sorts the forward index into its inverted form • Function performed by the ‘Sorter’

Overview : Ranking System • Proximity of keyword ‘hits’ • This is the sum of the distance between them • Hits have ‘types’ • Types: body text, heading text, anchor text, url, … • Relative font size factor used • Count how many hits occur of each type and range of proximity values • Apply a function to each type-proximity count • These form a type-proximity vector, C

f(x) Hit Count, x Overview : Ranking System (2) • V = C·W (dot product) is computed. • W is the importance associated with each type-proximity class. • Combine V with the PageRank score • Effect of increasing hits declines • Prevents large scale manipulation

PageRank : Motivation • Academic Citation Analysis* attempted, but… • Web has no formal quality control or peer review • Possible to inflate citation counts artificially • Web pages vary more than academic papers • Consider: • One link from the University’s main page, or one link from Yahoo’s main page… • Which citation should carry the higher weight ? *Also known as bibliometrics

PageRank : Description • Informal Definition: • “A page has a high rank if the sum of the ranks of its backlinks are high” • Handles ‘Yahoo’ case on previous slide • Intuitive Definition: • Corresponds to the Random Surfer Model • User keeps clicking on links ‘linearly’ then gets bored and restarts at a random location • Now for the maths…

PageRank : Description (2) • Formal Definition: • c is a ‘dampening’ factor, was 0.85 • Nv is number of out-links from page v • Bu is the set of backlinks from the current page • cE(u) corresponds to the surfer getting ‘bored’

A B E D C PageRank : Example • Considering an example network • Calculating A: c = dampening factor N = out-degree R = PageRank

A B E D C PageRank : Example (2) • Initially set all PageRank to 1 • First Iteration:

PageRank : Example (3) • Repeat process for B, C, D and E • Feed computed values into next iteration

PageRank : Analysis • Converges in log n time • Constrained by the time to build a full-text index more than anything • Rank ‘Sinks’ • Caused by two pages that point to each other but not to any other pages: rank accumulates • Solved by random surfer model • Manipulation – ‘Google Bombing’ • French Military ‘Victories’ links to ‘Defeats’ • ‘Miserable Failure’ links to George Bush biography

PageRank : Comparison • Web Graph Properties • Uses graph of the entire web: depends on full crawl • More sophisticated than simply summing in/out-degrees • Web Page Significance • Uses Boolean Spread Activation – match all words • Enhanced citation analysis – building on work of Kleinberg, Egghe & Rousseau • Doesn’t suffer from Tightly Knit Communities effect of Kleinberg’s Hubs & Authorities

PageRank : Further Work • Personalised PageRank, Haveliwala, 1999 • In-memory, block oriented, algorithm • PageRank can be computed in an hour on a PIII 450Mhz using less than 100Mb of main memory • Compute PageRank on the client-side • Use local information: bookmarks, searches, history • Provide the link structure of the web on a DVD • 11/11/05, “Personalized Search” released

PageRank : Further Work (2) • Topic Sensitive PageRank, Haveliwala, 2002 • Improve Google by giving weight to the informational relationship between sites • A) Uniform Results • Similar to ‘current’ Google but with topics • B) Personalised to a particular user • Based on previous searches and users’ surfing habits

Applications : Google • Google Inc. • Largest search engine • Technologies utilised by others (e.g. Yahoo!) • Biggest ever technology IPO, 2004 • Redefining search • Set a trend for other search providers • Raised importance of quality web search results • Combining information retrieval methods • Business model based on advertising • Potential area for conflict • Over 100 factors now influence results

Applications : PageRank • Back-link prediction • Desire for optimal web crawling strategy • Better indicator than citation counts! • Improving user navigation • ‘The PageRank Proxy’ • Providing PageRank information with links • Establishing trust • Wealth of authors on the web, who to trust? • Use PageRank to rate trust

Applications : The Future • Internal Development • Project no longer in academic realm • Lack of transparency initially intended • Role of PageRank unclear • Likely focus on extensions and results tuning • External Development • API’s • Allowing innovative use of Google technologies • Open Source Code • Focused on developing infrastructure

Conclusions • Academic Background • Success from strong academic understanding • Raised profile of informatics and search • Good platform for future research • Success as a failure • Intention for transparency and use in academia • Commercial success has removed transparency • Potentially bad for further research in this area

Summary • We have seen: • The architecture used by Google • PageRank as a web metric • Strengths and potential manipulations • The commercial success of Google • Applications • Potential areas of future research

References • Work by Brin & Page (now at Google) • Brin, S., Page, L. (1998), ‘The anatomy of a large-scale hypertextual search engine’, Computer Networks and ISDN Systems, 30(1-7):107--117. • Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project. • More papers at: http://www.google.com on many aspects of web metrics and search in general • PageRank • http://www.iprcom.com/papers/pagerank/ • Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu • http://en.wikipedia.org/wiki/Google_bomb

References (2) • Further Developments • Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical report, Stanford University, Stanford, CA, 1999. • Haveliwala, T. H. (2002), ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. • Commercial Aspect • http://money.cnn.com/2004/04/29/technology/google/ • http://www.google.com/corporate/history.html • Web Metrics • Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web metrics’, ACM Computing Surveys, 34(4):469--503.

The Anatomy of a Large-Scale Hypertextual Web Search Engine