70 likes | 269 Views
Web Crawlers. Nutch. Agenda. What are web crawlers Main policies in crawling Nutch Nutch architecture. Web crawlers. Crawl or visit web pages and download them Starting from one page –determine which page(s) to go to next This is where we know how good/bad, efficient a crawler is
E N D
Web Crawlers Nutch
Agenda • What are web crawlers • Main policies in crawling • Nutch • Nutch architecture
Web crawlers • Crawl or visit web pages and download them • Starting from one page –determine which page(s) to go to next • This is where we know how good/bad, efficient a crawler is • Mainly depends on crawling policies used
Crawl policies • Selection policy • Re-visit policy • Politeness policy • Parallelization policy • Selection policy • Pageranks • Path ascending • Focused crawling
Re-visit policy • Freshness • Age • Politeness • So that crawlers don’t overload web servers • Set a delay between GET requests • Parallelization • Distributed web crawling • To maximize download rate
Nutch • Is a Open Source web crawler • Nutch Web Search Application • Maintain DB of pages and links • Pages have scores, assigned by analysis • Fetches high-scoring, out-of-date pages • Distributed search front end • Based on Lucene http://lucene.apache.org/nutch/