Mercator: A Scalable, Extensible Web Crawler

Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p.219-229, Dec. 1999. May. 23. 2006 Sun Woo Kim

Content • Extensibility • Crawler traps and other hazards • Results of an extended crawl • Conclusions

Extensibility • Extensibility • Extend with new functionality • New protocol and processing modules • Different versions of most of its major components • Ingredients • Interface  an abstract class • Mechanism  a configuration file • Infrastructure

Protocol and processing modules • Abstract Protocol class • fetch method: download the document • newURL method: parse a given string • Abstract Analyzer class • process method: process it appropriately • Different Analyzer subclasses • GifStats • TagCounter • WebLinter: runs the Weblint program

Alternative URL frontier • Drawback on intranet • Multiple hosts might be assigned to the same thread • Solution • URL frontier component that dynamically assigns host • Maximized the number of busy worker threads • Is well-suited to host-limited crawls

As a random walker • Random walker • Starts at a random page taken from a set of seeds • The next page is selected by choosing a random link • Differences • A page may be revisited multiple times • Only one link is followed each time • To support random walking • A new URL frontier • Records only the URLs discovered most recently fetched file • Document fingerprint set • Never rejects documents as already having been seen

URL aliases • Four causes • Host name aliases  canonicalize • coke.com and cocacola.com  203.134.241.178 • Omitted port numbers  default value: 80 • Alternative paths on the same host  cannot avoid • digital.com/index.html and digital.com/home.html • Replication across different hosts  cannot avoid • Mirror sites • Cannot avoid  content-seen test

Session IDs embedded in URLs • Session identifiers • To tract the browsing behavior of their visitors • Create a potentially infinite set of URLs • Represent a special case of alternative paths • Document fingerprinting technique

Crawler traps • Crawler trap • Cause a crawler to crawl indefinitely • Unintentional: symbolic link • Intentional: trap using CGI programs • Antispam traps, traps to catch search engine crawlers • Solution • No automatic technique • But traps are easily noticed • Manually exclude the site • Using the customizable URL filter

Performance • Digital Ultimate Workstation • Two 533 MHz Alpha processors • 2 GB of RAM and 118 GB of local disk • Run in May 1999 • 77.4 million HTTP requests in 8 days • 112 docs/sec and 1,682 KB/sec • CPU cycle • 37%: JIT-compiled Java bytecode • 19%: Java runtime • 44%: Unix kernel

Selected Web statistics (1) • Relationship between URLs and HTTP requests

Selected Web statistics (2) • Breakdown of HTTP status codes relatively low

Selected Web statistics (3) • Size of successfully downloaded documents 80%

Selected Web statistics (4) • Distribution of MIME types

Conclusions • Use of Java • Made implementation easier and more elegant • Threads, garbage collection, objects, exception, etc. • Scalability • Extensibility Fin.

Mercator: A Scalable, Extensible Web Crawler

Mercator: A Scalable, Extensible Web Crawler

Presentation Transcript

Designing Scalable Web: Patterns

Scalable and Extensible Network Monitoring For GENI

What is a Web Crawler

Web crawler

Building a Web Crawler in Python

Web Crawler

Mercator

GIP MERCATOR OCEAN mercator-ocean.fr

Mercator: A scalable, extensible Web crawler

Mercator

A Scalable Web Cache Consistency Architecture

CoBase: Scalable and Extensible Cooperative Information System

Building Scalable Web Archives

Scalable Web Architectures

Scalable Web Architectures

Scalable Web Architectures

Mercator

FOCUS – A Scalable and Extensible Digital Format Registry

CoBase: Scalable and Extensible Cooperative Information System

Extensible Scalable Monitoring for Clusters of Computers