1 / 26

Hinrich Schütze and Christina Lioma Lecture 20: Crawling

Hinrich Schütze and Christina Lioma Lecture 20: Crawling. Overview. A simple crawler A real crawler. Outline. A simple crawler A real crawler. Basic crawler operation. Initialize queue with URLs of known seed pages Repeat Take URL from queue Fetch and parse page

cerise
Download Presentation

Hinrich Schütze and Christina Lioma Lecture 20: Crawling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hinrich Schütze and Christina Lioma Lecture 20: Crawling

  2. Overview • A simple crawler • A real crawler

  3. Outline • A simple crawler • A real crawler

  4. Basic crawleroperation • Initialize queue with URLs of known seed pages • Repeat • Take URL fromqueue • Fetchand parse page • Extract URLs frompage • Add URLs toqueue • Fundamental assumption: The web is well linked. 4

  5. Magnitude of the crawling problem • To fetch 20,000,000,000 pages in one month . . . • . . . we need to fetch almost 8000 pages per second! • Actually: many more since many of the pages we attempt to crawl will be duplicates, unfetchable, spam etc. 5

  6. What a crawler must do 6

  7. Robots.txt • Protocol for giving crawlers (“robots”) limited access to a website • Examples: • User-agent: * • Disallow: /yoursite/temp/ • User-agent: searchengine • Disallow: / • Important: cache the robots.txt file of each site we are crawling 7

  8. Example of a robots.txt (nih.gov) User-agent: PicoSearch/1.0 Disallow: /news/information/knight/ Disallow: /nidcd/ ... Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ User-agent: * Disallow: /news/information/knight/ Disallow: /nidcd/ ... Disallow: /news/research_matters/secure/ Disallow: /od/ocpl/wag/ Disallow: /ddir/ Disallow: /sdminutes/ 8

  9. What any crawler should do • Be capable of distributed operation • Be scalable: need to be able to increase crawl rate by adding more machines and extra bandwidth • Performance and efficiency: make efficient use of various system resources • Fetch pages of higher quality first • Continuous operation: get fresh version of already crawled pages • Extensible: crawlers should be designed to be extensible 9

  10. Outline • A simple crawler • A real crawler

  11. Basic crawl architecture 11

  12. URL frontier 12

  13. URL frontier • The URL frontier is the data structure that holds and manages URLs we’ve seen, but that have not been crawled yet. • Can include multiple pages from the same host • Must avoid trying to fetch them all at the same time • Must keep all crawling threads busy 13

  14. URL normalization • Some URLs extracted from a document are relative URLs. • E.g., at http://mit.edu, we may have aboutsite.html • This is the same as: http://mit.edu/aboutsite.html • During parsing, we must normalize (expand) all relative URLs. 14

  15. Content seen • For each page fetched: check if the content is already in the index • Check this using document fingerprints • Skip documents whose content has already been indexed 15

  16. Distributingthecrawler • Run multiple crawl threads, potentially at different nodes • Usuallygeographicallydistributednodes • Partition hosts being crawled into nodes 16

  17. Distributed crawler 17

  18. URL frontier: Two main considerations • Politeness: Don’t hit a web server too frequently • E.g., insert a time gap between successive requests to the same server • Freshness: Crawl some pages (e.g., news sites) more often than others 18

  19. Mercator URL frontier • URLs flow in from the top intothefrontier. • Front queues manage prioritization. • Back queuesenforcepoliteness. • Eachqueueis FIFO. 19

  20. Mercator URL frontier: Front queues • Prioritizerassignsto URL an integer prioritybetween 1 andF. • Thenappends URL tocorrespondingqueue • Heuristicsforassigningpriority: refresh rate, PageRanketc 20

  21. Mercator URL frontier: Front queues • Selectionfrom front queuesisinitiatedby back queues • Pick a front queuefromwhichtoselectnext URL: Roundrobin, randomly, ormoresophisticated variant • But with a bias in favorofhigh-priority front queues 21

  22. Mercator URL frontier: Back queues • Invariant 1. Each back queue is kept non-empty while the crawl is in progress. • Invariant 2. Each back queue only contains URLs from a single host. • Maintain a table from hosts to back queues. 22

  23. Mercator URL frontier: Back queues • In theheap: • Oneentryforeach back queue • The entryistheearliest time teatwhichthehostcorrespondingtothe back queuecanbehitagain. • The earliest time te is determinedby (i) last accesstothathost (ii) time gapheuristic 23

  24. Mercator URL frontier: Back queues • Howfetcherinteractswith back queue: • Repeat (i) extractcurrent root q of the heap (q is a back queue) • and (ii) fetch URL uat head of q . . . 24

  25. Mercator URL frontier: Back queues • Whenwehaveemptied a back queueq: • Repeat (i) pull URLs ufrom front queuesand (ii) add u to its corresponding back queue . . . • . . . until we get a u whosehostdoes not have a back queue. • Then put u in q and createheapentryfor it. 25

  26. Mercator URL frontier • URLs flow in from the top into the frontier. • Front queues manage prioritization. • Back queues enforce politeness. 26

More Related