1 / 33

Anatomy of a Large-Scale Hypertextual Web Search Engine

This presentation discusses the anatomy of a large-scale web search engine, focusing on the problems with existing search engines, motivation, methods, architecture, and major applications. It covers topics such as PageRank, anchor text, proximity, visual presentation, crawling, indexing, and searching.

lnumbers
Download Presentation

Anatomy of a Large-Scale Hypertextual Web Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Presentation on The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Qian Liu, Computer and Information Sciences Department

  2. Problem • Size of the Web: • In the order of hundreds of terabytes • Still growing • Problems with search engines: • Alta Vista, Excite, ect.: • Return huge number of documents entries • Too many low quality or marginally relevant matches Qian Liu, Computer and Information Sciences Department

  3. Problem • Yahoo: • Expensive • Slow to improve • Cannot cover all esoteric topics • Problems with users: • Inexperienced • Do not provide tightly constrained keywords Qian Liu, Computer and Information Sciences Department

  4. Motivation and Applications • To improve the quality web search engines • Scale to keep up with the growth of the Web • Academic search engine research • Current search engine technology: advertising oriented • “Open” search engine • Support research activities on large-scale web data Qian Liu, Computer and Information Sciences Department

  5. Methods • Basic Idea: • Q: “How can a search engine automatically identify high quality web pages for my topic?” • A: Hypertextual information --- improve search precision • Link structure • Anchor text • Proximity • Visual presentation Qian Liu, Computer and Information Sciences Department

  6. Methods PageRank PageRank: A measure of citation importance ranking Link Structure: Latent human annotation of importance Qian Liu, Computer and Information Sciences Department

  7. Qian Liu, Computer and Information Sciences Department

  8. Qian Liu, Computer and Information Sciences Department

  9. Qian Liu, Computer and Information Sciences Department

  10. Qian Liu, Computer and Information Sciences Department

  11. Qian Liu, Computer and Information Sciences Department

  12. Qian Liu, Computer and Information Sciences Department

  13. Qian Liu, Computer and Information Sciences Department

  14. Qian Liu, Computer and Information Sciences Department

  15. Methods • PageRank • Why PageRank works? • Often users want information from “trusted” source • Collaborative trust • Inexpensive to compute • Allows fast updates • Fewer privacy implications • Only public information is used Qian Liu, Computer and Information Sciences Department

  16. Methods • Anchor Text • Associates anchor text the page the link is on • the page the link points to • Accurate descriptions of web pages • Search non-indexable web pages Qian Liu, Computer and Information Sciences Department

  17. Methods • Proximity • Hits • Hits locations • Multi-word search: • Calculate proximity --- how far apart the hits occur in the document (or anchor) Qian Liu, Computer and Information Sciences Department

  18. Methods • Visual Presentation • Font size • Larger/bolder fonts --- higher weights • Capitalization --- higher weights Qian Liu, Computer and Information Sciences Department

  19. Methods Architecture and Major Data Structures Qian Liu, Computer and Information Sciences Department

  20. Methods Major Applications: Crawling Indexing Searching Qian Liu, Computer and Information Sciences Department

  21. Crawling How a crawler works: Defined webspace Requests URLs Stores the returned objects into a file system Examines the content of the object Scans for HTML anchor tags <A..> Ignores URLs not conforming to specified rule; Visits URLs conforming to the rules Qian Liu, Computer and Information Sciences Department

  22. Crawling • Google’s web Crawling System: • Fast distributed crawling system • URLServer serves URLs to crawlers • Each crawler keeps 300 connections open at once • Different states: • 1. Looking up DNS • 2. Connecting to host • 3. Sending request • 4. Receiving response Qian Liu, Computer and Information Sciences Department

  23. Indexing Uses Flex to generate a lexical analyzer Parse document Convert word into WordID Convert document into a set of hits Sorter sorts the result by wordID to generate inverted index Qian Liu, Computer and Information Sciences Department

  24. Searching Seek to the start of the doclist for every word Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query If we are not at the end of any doclist go to step 2 Sort the documents that have matched by rank and return the top k. Qian Liu, Computer and Information Sciences Department

  25. Searching • Ranking: • Ranking Function: • PageRank • Type weight • Count weight • Proximity Qian Liu, Computer and Information Sciences Department

  26. Results • A search on “bill clinton”: • High quality pages • Non-crawlable pages • No results about a bill other than clinton • No results about a clinton other than bill Qian Liu, Computer and Information Sciences Department

  27. Comparison with Other Search Engines • 1. Breadth-first search vs. depth-first search • 2. Comparison with WebCrawler: • WebCrawler: Files that the WebCrawler cannot index, such as pictures, sounds, etc., are not retrieved. • Google: Uses anchor text • 3. Number of crawlers: • WebCrawler: 15 • Google: typically 3 Qian Liu, Computer and Information Sciences Department

  28. Comparison with Other Search Engines (continued) • 4. Quantity vs. quality • Alta Vista: Favors quantity • Google: Provides quality search Qian Liu, Computer and Information Sciences Department

  29. Weak Points of Study 1. To limit response time, when a certain number of matching documents are found, searcher stops scanning, sorts and returns results. Sub-optimal results. 2. Lack features such as boolean operators and negation, etc. 3. Search efficiency: No optimizations such as query caching, subindices on common terms, and other common optimizations. Qian Liu, Computer and Information Sciences Department

  30. Suggestions for Future Study 1. Using link structure: In calculating PageRank: Exclude links between two pages with the same web domain (that often serve as navigation functions and do not confer authority). 2. Personalize PageRank by increasing the weight of a user’s homepage or bookmarks. “99% of the Web information is useless to 99% of the Web users.” Qian Liu, Computer and Information Sciences Department

  31. Suggestions for Future Study (continued) 3. Make use of hubs --- collections of links to authorities 4. In addition to anchor text, use text surrounding links, too. Qian Liu, Computer and Information Sciences Department

  32. Conclusions • Quality search results • Techniques: PageRank • Anchor text • Proximity • A complete architecture for crawling, indexing, and searching. Qian Liu, Computer and Information Sciences Department

  33. Qian Liu, Computer and Information Sciences Department

More Related