Mercator: A scalable, extensible Web crawler

Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, 1999 2006. 5. 23 Young Geun Han

Contents • Introduction • Related Work • Architecture of a scalable Web crawler • Extensibility • Crawler traps and other hazards • Results of an extended crawl • Conclusions

1. Introduction • The motivations of this work • Due to the competitive nature of the search engine business, Web crawler design is not well-documented in the literature • To collect statistics about the Web • Mercator, a scalable, extensible Web crawler • By scalable • Mercator is designed to scale up to the entire Web • They archive scalability by implementing their data structures so that use a bounded amount of memory, regardless of the size of the crawl • The vast majority of their data structures are stored on disk, and small parts of them are stored in memory for efficiency • By extensible • Mercator is designed in a modular way, with the expectation that new functionality will be added by third parties

- Run on a different machine • Use asynchronous I/O to fetch data from up to 300 Web servers • Transmit pages to a Store Server Crawler • Compress the pages • Store the pages to disk - Read URLs - Forward them to multiple crawler processes URL Server Store Server Anchors Repository • Read the pages from disk • Extract links from HTML pages • Save links to a different disk file Indexer URL Resolver • Read the link file • Resolve the URLs • Save the abolute URLs Sorter Doc index Barrels 2. Related work (1) • Web crawlers are almost as old as the Web itself • The first crawler, Matthew Gray’s Wanderer, 1993 (roughly coincided with the first release of NCSA Mosaic) • Google search engine [Brin and Page 1998;Google] • A distributed system that uses multiple machines for crawling • The crawler consists of five functional components

2. Related work (2) • Internet Archive [Burner 1997;InternetArchive] • The internet Archive also uses multiple machines to crawl the Web • Each crawler process is assigned up to 64 sites to crawl • Each crawler reads a list of seed URLs and uses asynchronous I/O to fetch pages from per-site queues in parallel • When a page is downloaded, the crawler extracts the links and adds to the appropriate site queue • Using a batch process, it merges “cross-site” URLs into the site-specific seed sets, filtering out duplicates in the process • SPHINK [Miller and Bharat 1998] • SPHINK system provides some of the customizability features (a mechanism for limiting which pages are crawled, document processing code) • SPHINK is targeted towards site-specific crawling, and therefore is not designed to be scalable

3. Architecture of a scalable Web crawler • The basic algorithm of any Web crawler takes a list of seed URLs • Remove a URL from the URL list • Determine the IP address of its host name • Download the corresponding document • Extract any links contained in document • For each of the extracted links, ensure that it is an absolute URL • Add a URL to the list of URLs to download, provided it has not been encountered before • Functional components • a component (URL frontier) for storing the list of URLs to download • a component for resolving host names into IP addresses • a component for downloading documents using the HTTP protocol • a component for extracting links from HTML documents • a component for determining whether a URL has been encountered before

Link Extractor HTTP RIS FTP Tag Counter Gopher GIF Stats 3.1 Mercator’s components (1) Mercator I N T E R N E T DNS Resolver Content Seen? Doc FPs 4 1 URL Filter URL Seen? URL Frontier 2 3 5 6 7 8 URL Set Queue Files Log Log Protocol Modules Processing Modules Figure 1. Mercator’s main components

3.1 Mercator’s components (2) • The first step of this loop isto remove an absolute URLfrom the sharedURL frontier for downloading • The protocol module's fetch method downloads the document from internetinto a per-thread RewindInputStream • The worker thread invokes the content-seen test to determine whether this document has been seen before • Based on the downloaded document's MIME type, the worker invokes the process method of each processing module associated with that MIME type • Each extracted link is converted into an absolute URL, and tested against a user-supplied URL filter to determine if it should be download • The worker performs the URL-seen test, which checks if the URL has been seen before • If the URL is new, it is added to the frontier 1 2 3 4 5 6 7 8

Naver I N T E R N E T HTTP Naver I N T E R N E T HTTP http://naver.com/c.html http://naver.com/b.html Head http://naver.com/a.html Daum HTTP Daum http://daum.net/B.html HTTP http://naver.com/c.html http://daum.net/B.html Head http://www.ssu.ac.kr http://daum.net/A.html SSU HTTP http://daum.net/A.html http://naver.com/b.html SSU HTTP Head http://naver.com/a.html Head http://www.ssu.ac.kr Protocol Module URL frontier Web Server Protocol Module URL frontier Web Server 3.2 The URL frontier • The URL frontier is the data structure that contains all the URLs that remain to be downloaded • To implement the politeness constraint, the default version of Mercator’s URL frontier is implemented by a collection of distinct FIFO subqueues • There is one FIFO subqueue per worker thread • When a new URL is added, the FIFO subqueue in which it is placed is determined by the URL’s canonical host name

Host name Robots exclusion rules (ex. User-agent, Disallow) LRU value (ex. Date) LRU replacement strategy www.naver.com *, /tmp/ 2006.05.23/09:00(1) 2^18 entries www. daum.net googlebot, /cafe/ 2006.05.23/09:20(2) www. ssu.ac.kr 2006.05.23/10:00(3) www. google.com *, /calendar/ 2006.05.23/10:10(4) 3.3 The HTTP protocol module • The purpose of a protocol module is to fetch the document corresponding to a given URL using the appropriate network protocol • Network protocols supported by Mercator include HTTP, FTP, and Gopher • Mercator implements the Robots Exclusion Protocol • To avoid downloading the RobotsExclusion file(robots.txt) on every request, Mercator's HTTP protocol module maintains a fixed-sized cachemappinghost names to their robots exclusion rules • Mercator uses its "lean and mean" HTTP protocol module • its requests time out after 1 minute, and it has minimal synchronization and allocation overhead

Protocol Modules Link Extractor HTTP RIS 2 Initialize Tag Counter Processing Modules GIF Stats 1 URL from the frontier text GIF Link text GIF Link RIS Work thread 3 Pass the RIS Rewinding 3.4 Rewind input stream • Mercator’s design allows the same document to be processed by multiple processing modules • To avoid reading a document over the network multiple times, Mercator caches the document locally using an abstraction called a RewindInputStream • A RIS caches small documents (64 KB or less) entirely in memory, while larger documents are temporarily written to a backing file (limit 1 MB) • A RIS also provides a method for rewinding its position to the beginning of the stream, and various lexing methods that make it easy to build MIME-type-specific parsers

SERVER B SERVER A Case C it.ssu.ac.kr/index.html Case D www.ssu.ac.kr/index.html Case A www.ssu.ac.kr/index.html Case B www3.ssu.ac.kr/index.html 3.5 Content-seen test (1) • The Web crawler downloads the same document contents multiple times • Many documents are available under multiple, different URLs • There are also many cases in which document are mirrored on multiple servers • To prevent processing a document more than once, a Web crawler may wish to perform a content-seen test to decide if the document has already been processed. • To save space and time, Mercator uses a data structure called the document fingerprint set that stores a 64-bit checksum of the contents of each downloaded document • Mercator compute the checksum using Broder’s implementation [Broder 1993] of Rabin’s fingerprinting algorithm [Rabin 1981] • Fingerprints offer provably strong probabilistic guarantees that two different string will not have the same fingerprint

Document fingerprint set Hash table Index of the disk file Memory Disk Java’s random access Use a readers-writer lock RIS Content-seen test Check the FP in memory 1 Not seen Content-seen test Check the FP in the disk file 2 Not seen 3 Add the new FP to the in-memory table Hash table fills up 4 Add the new fingerprint Fill up 5 Merge the contents with the FP on disk 6 Update the in-memory index of the disk file 3.5 Content-seen test (2) • Mercator maintains two independent set of fingerprints • A small hash table kept in memory • A large sorted list kept in a single disk file

URL Filter URL Seen? URL Frontier Link Extractor RIS URL filter class URL input output a boolean value crawl method Domain www.ssu.ac.kr www.naver.com True input output www.ssu.ac.kr input output False input True output www.daum.net Example 3.6 URL filters • The URL filtering mechanism provides a customizable way to control the set of URLs that are downloaded • The URL filter class has a single crawl method that takes a URL and returns a boolean value indicating whether or not to crawl that URL

DNS resolver I N T E R N E T DNS request I N T E R N E T DNS resolver DNS request DNS resolver DNS request Java interface to DNS lookups and the DNS interface on most Unix are synchronized DNS resolver DNS request Multi-thread DNS resoler 3.7 Domain name resolution • Before contacting a Web server, a Web crawler must use the DNS to map the host name into an IP address • Mercator tried to alleviate the DNS bottleneck by caching DNS results, but that was only partially effective • To avoid bottleneck of DNS, Mercator used its own multi-threaded DNSresolver that can resolve host names much more rapidly than either the Java or Unix resolver Reduce that elapsed time to 25% Perform DNS lookings accounted for 87% of each thread’s elapsed time

URL Seen test Link Extractor RIS URL Filter Popular URLs URL Frontier URL Seen test Memory The table of recently-added URLs This approach would result in a much larger frontier URL Set 3.8 URL-seen test (1) • To avoid downloading and processing a document multiple times, a URL-seen test must be performed on each extracted link • To perform the URL-seen test, all of the URLs seen by Mercator are stored in canonical form in a large table called the URL set • To save space, Mercator doesn’t store the textual representation of each URL in the URL set, but rather a fixed-sized check-sum • To reduce the number of operations on the backing disk file, Mercator keeps an in-memory cache of popular URLs

hit rate 16% in-memory cache 8% recently-added URLs the buffer 9.5% missed requests 66.2% 3.8 URL-seen test (2) • Unlike the fingerprints, the stream of URLs has a non-trivial amount of locality (URL locality) URL Seen test Popular URLs Memory The table of recently-added URLs URL Set Using an in-memory cache of 2^18 entries and the LRU-like clock replacement policy • Each URL set membership test induces one-sixth as many kernel calls as a membership test on the document fingerprint set (Each membership test on the URL set results in an average of 0.16 seek and 0.17 read kernel calls)

3.8 URL-seen test (3) • Host name locality • Host name locality arises because many links found in Web pages are to different documents on the same server • To preserve the locality, they compute the checksum of a URL by merging two independent fingerprints • The fingerprint of the URL’s host name • The fingerprint of the complete URL • These two fingerprints are merged so that the high-order bits of the checksum derive from the host name fingerprint • As a result, checksums for URLs with the same host component are numerically close together • The host name locality in the stream of URLs translates into access locality on the URL set’s backing disk file, allowing the kernel’s file system buffers to service read requests from memory more often • On extended crawls, this technique results in a significant reduction in disk load in a significant performance improvement

3.9 Synchronous vs. asynchronous I/O • Google and Internet Archive crawlers • Use single-threaded crawling processes and asynchronous I/O to perform multiple download in parallel • They are designed from the ground up to scale to multiple machines • Mercator • Uses a multi-threaded process in which each thread performs synchronous I/O (It leads to a much simpler program structure) • It would not be too difficult to adapt Mercator to run on multiple machines Web server Web server Thread Thread Web server Web server Thread machine Thread Web server Web server Thread Web server Web server Thread Thread machine Web server Web server machine Google and Archive cralwers Mercator

3.10 Checkpointing • To complete a crawl of the entire Web, Mercator writes regular snapshots of its state to disk • An interrupted or aborted crawl can easily be restarted from the lastest checkpoint • Mercator’s core classes and all user-supplied modules are required to implement the checkpointing interface • Checkpointing are coordinated using a global readers-writer lock • Each worker thread acquires a read share of the lock while processing a downloaded document • Once a day, Mercator’s main thread has acquired the lock, it arranges for the checkpoint methods

Mercator: A scalable, extensible Web crawler

Mercator: A scalable, extensible Web crawler

Presentation Transcript

Designing Scalable Web: Patterns

Scalable and Extensible Network Monitoring For GENI

What is a Web Crawler

Web crawler

Building a Web Crawler in Python

Web Crawler

Mercator

GIP MERCATOR OCEAN mercator-ocean.fr

Mercator: A scalable, extensible Web crawler

Mercator

A Scalable Web Cache Consistency Architecture

CoBase: Scalable and Extensible Cooperative Information System

Scalable Web Architectures

Scalable Web Architectures

Scalable Web Architectures

Mercator

FOCUS – A Scalable and Extensible Digital Format Registry

CoBase: Scalable and Extensible Cooperative Information System

Mercator: A Scalable, Extensible Web Crawler

Extensible Scalable Monitoring for Clusters of Computers