Parallel Crawlers

Parallel Crawlers Junghoo Cho (UCLA) HectorGarcia-Molina (Stanford) May 2002 Ke Gong 1

Crawler Single-process crawler Hard to scale Heavy Network loading Parallel crawler Scalability Increase number of crawling processes Network loading dispersion Crawl geographically adjacent pages Network loading reduction Crawling only through local network 2

Architecture of a parallel crawler • Independent • Each crawling process starts with its own set of seed URLs and follows links without consulting with other crawling processes. • Dynamic Assignment • There exists a central coordinator that logically divides the Web into small partitions and dynamically assigns each partition to a crawling process for download • Static Assignment • The Web is partitioned and assigned to each C-proc before they start to crawl 3

Three modes for static assignment • Independent • Dynamic Assignment • Static Assignment • Firewall mode • S1: a->b->c • Cross-over mode • S1: a->b->c->g->h->d->e • Exchange mode • S1: a->b->c->d->e 4

URL exchange minimization • Batch communication: • Instead of transferring an inter-partition URL immediately after it is discovered, a crawling process may wait for a while, to collect a set of URLs and send them in a batch. • Replication • If we replicate the most “popular” URLs at each crawling process and stop transferring them between crawling processes. we may significantly reduce URL exchanges. 5

Partition Function • URL-hash based: • Hash the whole URL • Pages in the same site can be assigned to different C-proc’s. • Locality of links not reflected • 2. Site-hash based: • Hash only the site name of URL • Locality preserved • Partition even-loaded • 3. Hierarchical: • Partition based on domain names, countries or other features • Lower inter-partition links • Partition not even-loaded 6

Parallel Crawler Models We will try to evaluate how different mode will effect on our crawling results. So now, we need an evaluation model! 7

Evaluation Models • Overlap: (N-I)/I • N: the total number of pages downloaded • I : the number of unique pages downloaded • Coverage: U/I • U: the number of pages has to download • I : the number of unique pages downloaded • Quality: |PN∩AN|/|PN| • PN : top N important pages from an ideal crawler • AN : : top N important pages from an actual crawler • Communication overhead: C/N • C: the total number of inter-partition URLs exchanged • N: the total number of pages downloaded 8

Dataset • The pages using our Stanford WebBase crawler in December 1999 in the period of 2 weeks. • The WebBase crawler started with the 1 Million URLs listed in Open Directory (http://www.dmoz.org) and followed links. • The dataset contains 40M pages • Many of dynamically-generated pages were still downloaded by the crawler 9

Firewall mode and coverage • When a relatively small number of crawling process are running in parallel, a crawler using the firewall mode provides good coverage. • The firewall mode is not a good choice if the crawler want to have a good coverage. 3. Increase number of seed urls will help to reach a better coverage 10

Crossover mode and overlap • When we have a larger number of crawling processes, we have to increase the overlap rate in order to obtain the same coverage. • 2. Overlap stay at zero until the coverage becomes relatively large • 3. A high coverage for crossover mode means a high overlap 11

Exchange mode and communication • Site hash has significantly lower communication overhead comparing to URL hash. • The network bandwidth used for URL exchange is relatively small, comparing to actual page download bandwidth • We can significantly reduce the communication overhead by replicating a relatively small number of URLs. 12

Quality and batch communication • As the number of crawling processes increases, the quality of downloaded pages becomes worse, unless they exchange messages often. • The quality of the firewall mode crawler(x=0) is significantly worse than that of the single-process crawler (x → ∞) when the crawler downloads a relatively small fraction of the pages • The communication overhead does not increase linearly as the number of URL exchange increases. • One does not need a large number of URL exchanges to achieve high quality. Crawler downloaded 500K pages 13

Summary • Firewall mode will be a good idea to choose if we want to run with fewer than 4 crawling processes but high coverage • Crossover crawlers incurs quite significant overlaps. • A crawler based on the exchange mode consumes small network bandwidth for URL exchanges (less than 1% of the network bandwidth). It can also minimize other overheads by adopting the batch communication technique. • By replicating between 10,000 and 100,000 popular URLs, we can reduce the communication overhead by roughly 40%. 14

Parallel Crawlers

Parallel Crawlers

Presentation Transcript

Counting Creepy Crawlers

Overview of Web-Crawlers

Crawlers and Crawling Strategies

Crawlers

Spiders, crawlers, harvesters, bots

Creepy Crawlers

Web Crawlers and Link Analysis

Web Crawlers

Using the Web Efficiently: Mobile Crawlers

Crawlers and Crawling Strategies

Creepy Crawlers

Parallel Crawlers

(Web) Crawlers Domain

Crawlers and Spiders

A Brief Look at Web Crawlers

Data collection with Web crawlers (Web-crawl graphs)

Features of RC Rock Crawlers

Insects and Creepy Crawlers

Content from the Web Servers + Crawlers

The Web Servers + Crawlers

REDESIGNING YOUR WEBSITE FOR USERS & CRAWLERS