The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University

ABSTRACT • Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. • crawl and index. • Technology and web Proliferation. • Anyone can publish anything they want.

INTRODUCTION • Challenges for information retrieval. • Google : espelling of googol (10100). • Web Search Engines – Escaling Up: • Wold Wide Web Worn (WWWW). • Google: Scaling with the Web

DESIGN GOALS • “The best navigation service should make ir easy to find almost anithyng on the web” • “Junk Results” • People are still only willing to look at the first few teen of results.

Push More Development • Understanding into the academic • Build sysmes that reasonable numbre of people can use. • Support novel activities on large-scale web data.

PAGE RANK:Bringing Order to the Web • The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.

ANCHOR TEXT • The text of links is treated in a special way in our search engine. • Accurate • For documents wich cannot be indexed.

SYSTEM ANATOMYARCHITECTURE OVERVIEW

MAJOR DATA ESTRUCTURES • A disk seek requires about 10 ms to complete. • Google avoid disk seeks

BIG FILES • Virtual files • Are adressable by 64 bit integers • Handles allocation and deallocation of file descriptors

REPOSITORY • Contains the full HTML • Use ZLIB • We can rebuild all the other data estructures from only the repository

DOCUMENT INDEX • Include: • Document status • Pointer into the repository • Document cheksum • statics • Converte URLs into docIDs

LEXICON • 256 MB main memory • Two parts: • List of words • Hash table of pointers

Hit List: • List of ocurrences • Use Huffman coding. • Forward Index: • 64 barrels • Inverted Index: • Barrels processed by the sorter • Sorted by docID

CRAWLING THE WEB • Involves interacting with hundreds of thousands of web servers and various name servers • A single URLserver serves lists of URLs to a number of crawlers. • Is implemented in python.

INDEXING THE WEB • Parcing • Must handle a huge array of possible errors • Use YACC to generate a CFG parser • We use flex to generate a lexical analizer • Indexing Documents into Barrels • The words are converted into a wordID • Sorting • Generate the inverted index

SEARCHING • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

The Ranking System • Factors: • Position • Font • Capiltalization • PageRank • Proximity

RESULTS AND PERFORMANCE

STORAGE REQUIREMENTS

SEARCH TIMES

CONCLUSIONS • Google is designed to be a escalable search engine. • Provide high quality search results over a repidly growing World Wide Web. • Google employs a number of techniques to inprove search quality. • Is a Architecture for gathering web pages, indexing them, and performing search queries.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University

The Anatomy of a Large-Scale Hypertextual Web Search Engine