60 likes | 187 Views
Measuring the performance of parallel crawlers in different modes. CSCI 572’s Class Project. Huy Pham PhD – Computer Science Spring 2011. Project inspired by the research paper on parallel crawlers. Site S1 is crawled by crawler C1 and site S2 is crawled by C2
E N D
Measuring the performance of parallel crawlers in different modes CSCI 572’s Class Project Huy Pham PhD – Computer Science Spring 2011
Project inspired by the research paper on parallel crawlers • Site S1 is crawled by crawler C1 and site S2 is crawled by C2 • In Firewall mode, crawlers ignore inter-partition links (C1 ignores g and C2 ignores d). Firewall mode makes no overlapping , quick performance (no communication between crawlers), but some data can be missed due to the elimination of inter-partition links. • In Cross-over mode, crawlers also follow inter-partition links, hence download more pages than in Firewall mode, but overlapping is an issue (g and d get downloaded twice). • In Exchange mode, crawlers periodically and incrementally exchange inter-partition links, hence avoid overlapping and increase coverage. Two parallel crawlers
Implementation • Crawling two websites in parallel: USC School of Letters, Arts and Sciences and USC main page: usc.edu. These two sites have their own data, and also share lots of links pointing to each other. • The data from domains other than LAS and usc.edu will get ignored in Firewall mode, only data from the two domains are crawled, no overlapping in this case. This data will be used to test the data from the cross-over and exchange modes. • In cross-over mode, besides the data from the two domains (Viterbi and LAS), only data from usc.edu will get crawled in order to limit the amount of data retrieved from the crawling processes. The reason is there are links from pages of the two domains that point to other different sites such as experiencela.com, thegrovela.com…, and those sites often contain too much data to handle. Overlapping will be expected in cross-over mode since both usc.edu and LAS have links that point to each other, hence the data will get crawled twice. Data(firewall mode) – Data(exchange-mode) = overlapping
In exchange mode, two crawlers (LAS and usc.edu) will exchange batches of information. When a crawler sees a page, whose domain is not the one it’s supposed to crawl, it will store the URL in a batch; when the batch is full, it will send the batch to the corresponding crawler. Viterbi crawler Nutch Solr usc.edu DBMS crawler Indexing crawler LAS
Cross-over mode • Graph picturing the dependence of percentage of overlapping on the total amount of data crawled. Example:
Comparing Exchange and Cross-Over modes • Graph representing the true data (overlapping excluded) that two crawlers have retrieved depending on time. The total data retrieved by each crawler will be approximately the same, but after the overlapping has been calculated and excluded from the cross-over mode, its retrieved data will be less than that of the exchange mode. Example: Data retrieved overlapping