1 / 15

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Dive into the inner workings of a large-scale search engine prototype presented by Sergey Brin and Lawrence Page. Explore the PageRank algorithm, data structures, crawling, indexing, searching, and more in this detailed overview.

varghese
Download Presentation

The Anatomy of a Large-Scale Hypertextual Web Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical and Computer Engineering

  2. Overview • @ Stanford University • Presented as a prototype of a large-scale search engine • 26 million pages, 147 GB • Google ~ googol • Issues • Scaling • Exploiting structure in Hypertext • PageRank Algorithm • Architecture • Data Structures, Crawling, Indexing, Searching • Results

  3. PageRank Algorithm using link graph • Anchor Text • Associate the anchor text of a link to the page it points to • Information Retrieval • TREC => well controlled, homogenous collections • Not equipped to handle Hypertext documents • Vector Space Model not enough

  4. Architecture • URL Server • Distributed Crawlers • Storeserver • Repository • Indexer • Barrels • URL Resolver • Sorter • DumpLexicon • Searcher

  5. Data Structures • BigFiles • Repository • Document Index • Lexicon • Hit Lists • Forward Index • Inverted Index

  6. Repository • Full HTML of every webpage • Compressed using zlib • Prefixed by docID, length, URL • Files stored one after another

  7. Document Index • Fixed width ISAM index • Stores document status, pointer to repository, document checksum • If document has been crawled, ptr to variable length docinfo file stored • Otherwise ptr to URLlist stored

  8. Hit Lists • Plain and Fancy hits • 2 bytes for each hit • Length of hit list stored before hit

  9. Forward Index • Stored in 64 barrels. • If a document contains words in a barrel, then the docID is recorded into the barrel, with the list of wordID’s and hitlists. • Each wordID stored as a relative difference from the minimum wordID in a barrel. (24 bits for the wordID, 8 for hitlist length).

  10. Inverted Index • Same barrels as forward index, but processed by the sorter. • For every wordID, doclist of docIDs generated, with corresponding hitlists. • Two sets of inverted barrels, one for hitlists with anchor or title text, another for all hitlists.

  11. Indexing the Web • Parser – flex used to generate a lexical analyzer – “involved a fair amount or work” • Indexing Documents into barrels • Every word hashed into wordID • Occurrences translated into hitlists and written into forward barrels • Lexicon needs to be shared • Extra words written into a log, processed by one final indexer

  12. Searching • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. • Sort the documents that have matched by rank and return the top k.

  13. Ranking… • Count weight generated for each word in query • Dot product taken with type weight vector (for single word queries) or with type-prox weight vector (for multiple word queries) • Combined with PageRank to give final score.

  14. Results • High quality pages • zlib – 3:1 ratio • 9 days to download 26 million pages • Indexer and crawler ran simultaneously • Future work: • Query caching, smart disk allocation, updates • User context, relevance feedback

  15. Footnote …foot in mouth!! • “we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.”

More Related