Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Anatomy of a Large-Scale Hypertextual Web Search Engine(e.g. Google)

The Web • Index sizes • 1994: World Wide Web Worm (McBryan) indexes 110,000 pages/documents • 1997: search engines claim from 2-100 million pages indexed • 2005: Google claims 8 billion • later claim of 25 billion was removed • Queries • 1994: WWWW had 1500 queries per day • 1997: AltaVista had 20 million queries per day • 2002: Google had 250 million queries a day • 2004: 2.5 billion queries per day • Clearly Web search engines must scale to increased corpus size and increased use

Google Design Goals • Scale up to work with much larger collections • In 1997, older search engines were breaking down due to manipulations by advertisers, etc. • Exhibit high precision among top-ranked documents • Many documents are kind-of relevant • For the Web, “relevant” should be just the very best documents • Provide data sets for academic research on search engines

Changes from Earlier Engines • PageRank • Rank importance of pages • Use citation patterns like for academic research • Normalize for number of links on page • Variation on probability that a random surfer would view a page • Anchor Text • Store text of anchors with the pages they link to • Incoming link anchors can be better index than page content • Allows uncrawled and nontextual content to be indexed • Not a new idea: used by WWWW in 1994

Related Work • Information retrieval • Traditional IR metrics (TREC 96) called 20GB a very large corpus • Term vectors assume all documents have some value • short documents (few words) that match query term get ranked highest • “bill clinton” returns page that only says “bill clinton sucks” • Different from well-controlled collections • Content has abnormal vocabulary (e.g. product codes) • Content generated by systems (e.g. complex ids, binhexed data) • Content includes bugs/flakiness/disinformation

System Anatomy

Google Anatomy • URL Server • Manages lists of URLs to be retrieved • Crawlers • Retrieve content for URLs provided • Store Server • Compresses and stores content in repository

Google Anatomy • Indexer • Reads & uncompresses content in repository • Parses documents for word occurrences (“hits”) and links • Records word, position in document, relative font size, capitalization for each hit • Hits divided into barrels • Records anchor text and from and to information for each link in Anchors file • Generates a lexicon (vocabulary list)

Google Anatomy • URL Resolver • Converts relative URLs into absolute URLs • Generates document ID for each absolute URL • Puts anchor text into forward index for page that link points to • Generates a database of links that are pairs of document IDs

Google Anatomy • Sorter • Resorts (in place) barrels by word ID to generate inverted index • Produces list of word IDs and offsets into inverted index

Google Anatomy • PageRank • Generates page ranking based on links • DumpLexicon • Takes lexicon and inverted index and generates lexicon used by Searcher

Google Data Structures • BigFiles • Used to create a virtual file system spanning multiple file systems • Addressable by 64 bit integers • Repository • Page contents compressed using zlib to balance speed and compression factor • Stores document ID, length, and URL prefixed to document • Documents stored sequentially

Google Data Structures • Document Index • Ordered by document ID • Includes document status, pointer into repository, document checksum, other statistics • Lexicon • 14 million words (rare words not included here) concatenated together with separating nulls • Hash table of pointers

Google Data Structures • Hit Lists • Each hit encoded in two bytes • Plain and fancy hits • Plain hits include • 1 capitalization bit • 3 font size bits • 12 position bits • Fancy hits (identified by 111 in font size bits) • 4 bits for hit type • 8 bits for position • Anchor type hits have • 4 bits for hash of document ID of the anchor • 4 bits for position

Google Data Structures • Forward Index • Stored in 64 barrels • Barrel has a range of word IDs • Barrel includes documentID and list of word IDs thatbelong in barrel • Document IDs duplicated across barrels • Word IDs stored as differences from minimum ID for the barrel (24 bits per word)

Google Data Structures • Inverted Index • Same barrels as forward index • Lexicon points to barreland to a list of documentIDs with their hit lists • Two sets of barrels • One for title and link anchor text • Use this one first • One for the rest • Use this if not enough hits in above barrels

Crawling, Indexing, Searching • Web sites were surprised to be crawled in 1997 • Requires people to deal with email questions • Parsing HTML is a challenge • Does not use YACC to generate CFG parser • Too much overhead (too slow) • Uses flex to generate a lexical analyzer • Ranking • Use vector of count weights and vector of type weights • Bias towards proximity of search terms • Use PageRank to give a final rank

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)