Web Search – Summer Term 2006 VI. Web Search - Indexing

Web Search – Summer Term 2006VI. Web Search -Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

Indexing in the 1st Google engine - Parsing of the HTML pages in the repository- Indexing of the document - Store indexed docs in barrels - Code words in a wordID - Create lexicon that maps words to wordIDs - Store hit lists in forward barrels (Note: Indexing process is parallelized)- Sorting - Sort anchor and title hits from the forward barrels in inverted barrels and all other hits in full text inverted barrels Now: Description of the major data structures

REPOSITORY: DOCID ECODE URL_LEN PAGE_LEN URL PAGE . . . CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)

CRAWLERS SORTERS Architecture of the 1st Google Search Engine DOCUMENT INDEX: DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) IF DOCUMENT HAS BEEN CRAWLED - POINTER TO URL LIST OTHERWISE ADDITIONAL FILE TO CONVERT URLS TO DOCIDs: URL CHECKSUM -> DOCID URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)

CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER ANCHORS: SOURCE, DESTINATION,AND ANCHOR TEXT INDEXER ANCHORS ANCHORS URL RESOLVER DUMPLEXICON LINKS LINKS DOC INDEX LEXICON BARRELS PAGERANK LINKS:PAIRWISE DOCIDS (CF. [2], FIG. 1)

INVERTED INDEX: WORD -> DOCUMENT LEXICON: INVERTED BARRELS: CRAWLERS WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NDOCS . . . FORWARD INDEX: DOCUMENT -> WORD . . . DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . . . . SORTERS NULL WORDID DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... • HITS: • FANCY HIT (URL, TITLE, ANCHOR TEXT, META TAG) • PLAIN HIT (EVERYTHING ELSE) WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . BARRELS BARRELS CAPITALIZATION, FONTSIZE, TYPE, POSITION IN DOCUMENT CAPITALIZATION, FONTSIZE, POSITION IN DOCUMENT Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON LEXICON PAGERANK (CF. [2], FIG. 1)

CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK PAGERANK (CF. [2], FIG. 1)

REPOSITORY Query Processing DOCID ECODE URLLEN PAGELEN URL PAGE DOCUMENT INDEX DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) PAGERANK HITLIST CAPITALIZATION, FONTSIZE, TYPE, POS. IN DOC LEXICON WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... INVERTED INDEX / BARRELS DOCID, NO-OF-HITS, HIT1, HIT2, ... . . .

Further reading Note: This was information from a paper from 1998 (with a collection of 25 million pages) Newer information about the infrastructure and data structure used by Google (today?) can be found in the following references: Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Proc. on Large Clusters Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google File System Luiz Andre Barroso, Jeffrey Dean, Urs Hoelzle: Web Search for a Planet: The Google Cluster Archit. which are available at http://labs.google.com/papers/

References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001

Web Search – Summer Term 2006 VI. Web Search - Indexing

Web Search – Summer Term 2006 VI. Web Search - Indexing

Presentation Transcript

Adversarial Search

Minimax search algorithm

Search Capabilities and Features in SharePoint 2010

I Learned It the Hard Way: Observations about Search Interface Design and Evaluation

Search and Rescue Operations

Introduction to Web Browsers and Basic Search Strategies Using Search Engines

CASARA NATIONAL NAVIGATOR COURSE Search and Rescue Procedures

Outline: Problem solving and search

NOVA SCOTIA GROUND SEARCH AND RESCUE ASSOCIATION

Latent Semantic Indexing

How to Build a Search Engine

EBI services

Course: Engineering Artificial Intelligence

How to search with the PATENTSCOPE search system

Search Engines

Binary Search Trees

Latent Semantic Indexing

Artificial Intelligence

Google and internet search Search strategies and techniques for better results

Heuristic Search