190 likes | 306 Views
Zhao Dongsheng 2008.9.29. Introduction to Nutch. Summary. What's Nutch Nutch's architecture How to use Nutch About the first homework. What's Nutch. Written in java Open-source project An Application that can build SE Behind a lot of web sites. What's Nutch. Lucene and Nutch
E N D
Zhao Dongsheng 2008.9.29 Introduction to Nutch
Summary • What's Nutch • Nutch's architecture • How to use Nutch • About the first homework
What's Nutch • Written in java • Open-source project • An Application that can build SE • Behind a lot of web sites
What's Nutch • Lucene and Nutch • Nutch grow out of Lucene • Both open-source project • Both written in java • But Lucene is a Java library for text indexing and search • Nutch is an Application • Nutch uses lucene for indexing
Nutch's core components • Fecher • Requests web pages • Parses and extracts links • Web DB • Page DB • Used for fetch sheduling • Link DB • Store link gragh • Store anchor text with each link • Link-analysis and Anchor text indexing
Nutch's core components (cont.) • Indexer • Creates inverted index • Uses Lucene • Searcher • Finds relelant docs quickly • Ranks the docs • Summarizing
Functions Nutch supports • Politeness when crawling • Duplicates removing • PageRank analysis • Distributed searching • Summarizing • ......
Nutch's Technical Goals • Fetch several billion pages per month • Maintain an index of these pages • Search that index up to 1000 times per second • Provide very high quality search results • Operate at minimal cost
Source code & API • Source Dirs • analysis crawl html plugin scoring segment tools fetcher indexer net parse protocol searcher ... • crawl/Crawl.java • fetcher/Fetcher.java
How to use Nutch • Download & unpack • Nutch required JVM • Set environment variables • Configure • Specify root URLs • Specify URLs filters • Optionally specify • Number of threads • Levels to crawl • Fetch delay
How to use Nutch (cont.) • Root URLs Example • http://www.pku.edu.cn • URL Filter Example • crawl-urlfilter.txt • -^(file|ftp|mailto): • -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ • -[?*!@=] • +^http://([a-z0-9]*\.)*pku.edu.cn/
How to use Nutch (cont.) • Run Nutch • Just a command line • bin/nutch crawl myurl.txt -dir mycrawl -depth 4 >& crawl.log • Use Tomcat to experience!
About the first Homework • About web crawling • Familiar with Nutch & java • Fetch blog/bbs etc ? • Need your advice!
Q & A thanks!