100 likes | 211 Views
The Anatomy of a Large-Scale Hypertextual Web Search Engine In : 7th International WWW Conference (1998 ). Authors: Brin S. and Page L. Presented By: Shiliang Xue. Introduction. Google is designed to scale well to extremely large data sets. Fast crawling technology
E N D
The Anatomy of a Large-Scale Hypertextual Web Search EngineIn: 7th International WWW Conference (1998) Authors:Brin S. and Page L. Presented By: ShiliangXue
Introduction Google is designed to scale well to extremely large data sets. • Fast crawling technology • Storage space must be used efficiently • Efficient indexing system • Queries must be handled quickly
System Features • Googlemakes use of the link structure of the Web to calculate a quality ranking(PageRank) for each web page. PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) 2. Google utilizes link to improve search results. Associate the text of a link with the page that it is on Associate the text of a link with the page that it points to
System Features Aside from the PageRank and the use of anchor text, Google has several other features. • Location information • visual presentation details • Full raw HTML of pages is available in a repository
System Anatomy Architecture Overview • Crawler • URLserver • Storeserver • Indexer • URLresolver • Sorter • Searcher
System Anatomy Major Data Structures • BigFiles • Repository • Document Index • Lexicon • Hit Lists • Forward Index • Inverted Index
System Anatomy Working Procedure • Crawling the Web • Indexing the Web • Parsing • Indexing Documents into Barrels • Sorting • Searching • The Ranking System • Feedback
System Anatomy • Searching • Parse the query. • Convert words into wordIDs. • Seek to the start of the doclist in the short barrel for every word. • Scan through the doclists until there is a document that matches all the search terms. • Compute the rank of that document for the query. • If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. • If we are not at the end of any doclist go to step 4. • Sort the documents that have matched by rank and return the top k
Conclusion • Google is designed to be a scalable search engine. • The primary goal of Google is to provide high quality search results. • Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.