50 likes | 185 Views
Measuring the performance of parallel crawlers in different modes. CSCI 572’s Class Project. Huy Pham PhD – Computer Science Spring 2011. Project inspired by the research paper on parallel crawlers. Site S1 is crawled by crawler C1 and site S2 is crawled by C2
E N D
Measuring the performance of parallel crawlers in different modes CSCI 572’s Class Project Huy Pham PhD – Computer Science Spring 2011
Project inspired by the research paper on parallel crawlers • Site S1 is crawled by crawler C1 and site S2 is crawled by C2 • In Firewall mode, crawlers ignore inter-partition links (C1 ignores g and C2 ignores d). Firewall mode makes no overlapping , quick performance (no communication between crawlers), but some data can be missed due to the elimination of inter-partition links. • In Cross-over mode, crawlers also follow inter-partition links, hence download more pages than in Firewall mode, but overlapping is an issue (g and d get downloaded twice). Two parallel crawlers
Crawler 1 Crawler 2 Viterbi LAS usc.edu
Continued.. • In Exchange mode, crawlers periodically and incrementally exchange inter-partition links, hence avoid overlapping and increase coverage. • Implementation: Crawling two websites in parallel: USC Viterbi School of Engineering and USC School of Letters, Arts and Sciences. These two sites have their own data, and also share lots of links (generally to each other and to USC website). The data from USC website will get ignored in Firewall mode, overlapping will happen in cross-over mode when the two sites point to each other, and exchange mode will prove to be the best among the three modes. Nutch Solr Viterbi crawler DBMS Indexing LAS crawler