The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical and Computer Engineering

Overview • @ Stanford University • Presented as a prototype of a large-scale search engine • 26 million pages, 147 GB • Google ~ googol • Issues • Scaling • Exploiting structure in Hypertext • PageRank Algorithm • Architecture • Data Structures, Crawling, Indexing, Searching • Results

PageRank Algorithm using link graph • Anchor Text • Associate the anchor text of a link to the page it points to • Information Retrieval • TREC => well controlled, homogenous collections • Not equipped to handle Hypertext documents • Vector Space Model not enough

Architecture • URL Server • Distributed Crawlers • Storeserver • Repository • Indexer • Barrels • URL Resolver • Sorter • DumpLexicon • Searcher

Data Structures • BigFiles • Repository • Document Index • Lexicon • Hit Lists • Forward Index • Inverted Index

Repository • Full HTML of every webpage • Compressed using zlib • Prefixed by docID, length, URL • Files stored one after another

Document Index • Fixed width ISAM index • Stores document status, pointer to repository, document checksum • If document has been crawled, ptr to variable length docinfo file stored • Otherwise ptr to URLlist stored

Hit Lists • Plain and Fancy hits • 2 bytes for each hit • Length of hit list stored before hit

Forward Index • Stored in 64 barrels. • If a document contains words in a barrel, then the docID is recorded into the barrel, with the list of wordID’s and hitlists. • Each wordID stored as a relative difference from the minimum wordID in a barrel. (24 bits for the wordID, 8 for hitlist length).

Inverted Index • Same barrels as forward index, but processed by the sorter. • For every wordID, doclist of docIDs generated, with corresponding hitlists. • Two sets of inverted barrels, one for hitlists with anchor or title text, another for all hitlists.

Indexing the Web • Parser – flex used to generate a lexical analyzer – “involved a fair amount or work” • Indexing Documents into barrels • Every word hashed into wordID • Occurrences translated into hitlists and written into forward barrels • Lexicon needs to be shared • Extra words written into a log, processed by one final indexer

Searching • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. • Sort the documents that have matched by rank and return the top k.

Ranking… • Count weight generated for each word in query • Dot product taken with type weight vector (for single word queries) or with type-prox weight vector (for multiple word queries) • Combined with PageRank to give final score.

Results • High quality pages • zlib – 3:1 ratio • 9 days to download 26 million pages • Indexer and crawler ran simultaneously • Future work: • Query caching, smart disk allocation, updates • User context, relevance feedback

Footnote …foot in mouth!! • “we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.”

The Anatomy of a Large-Scale Hypertextual Web Search Engine