100 likes | 280 Views
Web Search – Summer Term 2006 VI. Web Search - Indexing. (c) Wolfgang Hürst, Albert-Ludwigs-University. Indexing in the 1st Google engine.
E N D
Web Search – Summer Term 2006VI. Web Search -Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University
Indexing in the 1st Google engine - Parsing of the HTML pages in the repository- Indexing of the document - Store indexed docs in barrels - Code words in a wordID - Create lexicon that maps words to wordIDs - Store hit lists in forward barrels (Note: Indexing process is parallelized)- Sorting - Sort anchor and title hits from the forward barrels in inverted barrels and all other hits in full text inverted barrels Now: Description of the major data structures
REPOSITORY: DOCID ECODE URL_LEN PAGE_LEN URL PAGE . . . CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)
CRAWLERS SORTERS Architecture of the 1st Google Search Engine DOCUMENT INDEX: DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) IF DOCUMENT HAS BEEN CRAWLED - POINTER TO URL LIST OTHERWISE ADDITIONAL FILE TO CONVERT URLS TO DOCIDs: URL CHECKSUM -> DOCID URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)
CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER ANCHORS: SOURCE, DESTINATION,AND ANCHOR TEXT INDEXER ANCHORS ANCHORS URL RESOLVER DUMPLEXICON LINKS LINKS DOC INDEX LEXICON BARRELS PAGERANK LINKS:PAIRWISE DOCIDS (CF. [2], FIG. 1)
INVERTED INDEX: WORD -> DOCUMENT LEXICON: INVERTED BARRELS: CRAWLERS WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NDOCS . . . FORWARD INDEX: DOCUMENT -> WORD . . . DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . . . . SORTERS NULL WORDID DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... • HITS: • FANCY HIT (URL, TITLE, ANCHOR TEXT, META TAG) • PLAIN HIT (EVERYTHING ELSE) WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . BARRELS BARRELS CAPITALIZATION, FONTSIZE, TYPE, POSITION IN DOCUMENT CAPITALIZATION, FONTSIZE, POSITION IN DOCUMENT Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON LEXICON PAGERANK (CF. [2], FIG. 1)
CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK PAGERANK (CF. [2], FIG. 1)
REPOSITORY Query Processing DOCID ECODE URLLEN PAGELEN URL PAGE DOCUMENT INDEX DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) PAGERANK HITLIST CAPITALIZATION, FONTSIZE, TYPE, POS. IN DOC LEXICON WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... INVERTED INDEX / BARRELS DOCID, NO-OF-HITS, HIT1, HIT2, ... . . .
Further reading Note: This was information from a paper from 1998 (with a collection of 25 million pages) Newer information about the infrastructure and data structure used by Google (today?) can be found in the following references: Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Proc. on Large Clusters Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google File System Luiz Andre Barroso, Jeffrey Dean, Urs Hoelzle: Web Search for a Planet: The Google Cluster Archit. which are available at http://labs.google.com/papers/
References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001