1 / 27

The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Anatomy of a Large-Scale Hypertextual Web Search Engine. Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University. ABSTRACT. Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. crawl and index.

dwight
Download Presentation

The Anatomy of a Large-Scale Hypertextual Web Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University

  2. ABSTRACT • Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. • crawl and index. • Technology and web Proliferation. • Anyone can publish anything they want.

  3. INTRODUCTION • Challenges for information retrieval. • Google : espelling of googol (10100). • Web Search Engines – Escaling Up: • Wold Wide Web Worn (WWWW). • Google: Scaling with the Web

  4. DESIGN GOALS • “The best navigation service should make ir easy to find almost anithyng on the web” • “Junk Results” • People are still only willing to look at the first few teen of results.

  5. Push More Development • Understanding into the academic • Build sysmes that reasonable numbre of people can use. • Support novel activities on large-scale web data.

  6. PAGE RANK:Bringing Order to the Web • The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) • Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.

  7. ANCHOR TEXT • The text of links is treated in a special way in our search engine. • Accurate • For documents wich cannot be indexed.

  8. SYSTEM ANATOMYARCHITECTURE OVERVIEW

  9. MAJOR DATA ESTRUCTURES • A disk seek requires about 10 ms to complete. • Google avoid disk seeks

  10. BIG FILES • Virtual files • Are adressable by 64 bit integers • Handles allocation and deallocation of file descriptors

  11. REPOSITORY • Contains the full HTML • Use ZLIB • We can rebuild all the other data estructures from only the repository

  12. DOCUMENT INDEX • Include: • Document status • Pointer into the repository • Document cheksum • statics • Converte URLs into docIDs

  13. LEXICON • 256 MB main memory • Two parts: • List of words • Hash table of pointers

  14. Hit List: • List of ocurrences • Use Huffman coding. • Forward Index: • 64 barrels • Inverted Index: • Barrels processed by the sorter • Sorted by docID

  15. CRAWLING THE WEB • Involves interacting with hundreds of thousands of web servers and various name servers • A single URLserver serves lists of URLs to a number of crawlers. • Is implemented in python.

  16. INDEXING THE WEB • Parcing • Must handle a huge array of possible errors • Use YACC to generate a CFG parser • We use flex to generate a lexical analizer • Indexing Documents into Barrels • The words are converted into a wordID • Sorting • Generate the inverted index

  17. SEARCHING • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

  18. The Ranking System • Factors: • Position • Font • Capiltalization • PageRank • Proximity

  19. RESULTS AND PERFORMANCE

  20. STORAGE REQUIREMENTS

  21. SEARCH TIMES

  22. CONCLUSIONS • Google is designed to be a escalable search engine. • Provide high quality search results over a repidly growing World Wide Web. • Google employs a number of techniques to inprove search quality. • Is a Architecture for gathering web pages, indexing them, and performing search queries.

  23. The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca K.ApazaH@gmail.com San Pablo Catholic University

More Related