150 likes | 172 Views
Mercator: A Scalable, Extensible Web Crawler. Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim. Content. Extensibility Crawler traps and other hazards Results of an extended crawl Conclusions. Extensibility.
E N D
Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim
Content • Extensibility • Crawler traps and other hazards • Results of an extended crawl • Conclusions
Extensibility • Extensibility • Extend with new functionality • New protocol and processing modules • Different versions of most of its major components • Ingredients • Interface an abstract class • Mechanism a configuration file • Infrastructure
Protocol and processing modules • Abstract Protocol class • fetch method: download the document • newURL method: parse a given string • Abstract Analyzer class • process method: process it appropriately • Different Analyzer subclasses • GifStats • TagCounter • WebLinter: runs the Weblint program
Alternative URL frontier • Drawback on intranet • Multiple hosts might be assigned to the same thread • Solution • URL frontier component that dynamically assigns host • Maximized the number of busy worker threads • Is well-suited to host-limited crawls
As a random walker • Random walker • Starts at a random page taken from a set of seeds • The next page is selected by choosing a random link • Differences • A page may be revisited multiple times • Only one link is followed each time • To support random walking • A new URL frontier • Records only the URLs discovered most recently fetched file • Document fingerprint set • Never rejects documents as already having been seen
URL aliases • Four causes • Host name aliases canonicalize • coke.com and cocacola.com 203.134.241.178 • Omitted port numbers default value: 80 • Alternative paths on the same host cannot avoid • digital.com/index.html and digital.com/home.html • Replication across different hosts cannot avoid • Mirror sites • Cannot avoid content-seen test
Session IDs embedded in URLs • Session identifiers • To tract the browsing behavior of their visitors • Create a potentially infinite set of URLs • Represent a special case of alternative paths • Document fingerprinting technique
Crawler traps • Crawler trap • Cause a crawler to crawl indefinitely • Unintentional: symbolic link • Intentional: trap using CGI programs • Antispam traps, traps to catch search engine crawlers • Solution • No automatic technique • But traps are easily noticed • Manually exclude the site • Using the customizable URL filter
Performance • Digital Ultimate Workstation • Two 533 MHz Alpha processors • 2 GB of RAM and 118 GB of local disk • Run in May 1999 • 77.4 million HTTP requests in 8 days • 112 docs/sec and 1,682 KB/sec • CPU cycle • 37%: JIT-compiled Java bytecode • 19%: Java runtime • 44%: Unix kernel
Selected Web statistics (1) • Relationship between URLs and HTTP requests
Selected Web statistics (2) • Breakdown of HTTP status codes relatively low
Selected Web statistics (3) • Size of successfully downloaded documents 80%
Selected Web statistics (4) • Distribution of MIME types
Conclusions • Use of Java • Made implementation easier and more elegant • Threads, garbage collection, objects, exception, etc. • Scalability • Extensibility Fin.