200 likes | 378 Views
Nutch in a Nutshell (part I). Presented by Liew Guo Min Zhao Jin. Outline. Overview Nutch as a web crawler Nutch as a complete web search engine Special features Installation/Usage (with Demo) Exercises. Overview. Complete web search engine
E N D
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin
Outline • Overview • Nutch as a web crawler • Nutch as a complete web search engine • Special features • Installation/Usage (with Demo) • Exercises
Overview • Complete web search engine • Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) • Java based, open source • Features: • Customizable • Extensible (Next meeting) • Distributed (Next meeting)
Injector Fetcher Parser CrawlDBTool Generator Web Nutch as a crawler Initial URLs CrawlDB Webpages/files update get read/write generate read/write Segment
CrawlDB LinkDB Indexer Searcher Nutch as a complete web search engine Segments (Lucene) Index (Lucene) GUI (Tomcat)
Special Features • Customizable • Configuration files (XML) • Required user parameters • http.agent.name • http.agent.description • http.agent.url • http.agent.email • Adjustable parameters for every component • E.g. for fetcher: • Threads-per-host • Threads-per-ip
Special Features • URL Filters (Text file) • Regular expression to filter URLs during crawling • E.g. • To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ • To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/ • Plugin-information (XML) • The metadata of the plugins (More details next week)
Installation & Usage • Installation • Software needed • Nutch release • Java • Apache Tomcat (for GUI) • Cgywin (for windows)
Installation & Usage • Usage • Crawling • Initial URLs (text file or DMOZ file) • Required parameters (conf/nutch-site.xml) • URL filters (conf/crawl-urlfilter.txt) • Indexing • Automatic • Searching • Location of files (WAR file, index) • The tomcat server
Exercises • Questions: • What are the things that need to be done before starting a crawl job with Nutch? • What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website? • Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • What do you think are good crawling behaviors? • Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • What are the advantages of using Nutch instead of commercial search engines?
Answers • What are the things that need to be done before starting a crawl job with Nutch? • Set the CLASSPATH to the Lucene Core • Set the JAVA_HOME path • Create a folder containing urls to be crawled • Amend the crawl-urlfilter file • Amend the nutch-site.xml file to include the user parameters
What are the ways tell Nutch what to crawl and what not? • Url filters • Depth in crawling • Scoring function for urls • What can you do if you are the owner of a website? • Web Server Administrators • Use the Robot Exclusion Protocol by adding the following in /robots.txt • HTML Author • Add the Robots META tag
Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • To ensure accountability (although tracing is still possible without them) • What do you think are good crawling behaviors? • Be Accountable • Test Locally • Don't hog resources • Stay with it • Share results
Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • True but one can always make changes in Nutch to minimize the effect. • What are the advantages of using Nutch instead of commercial search engines? • Open-source • Transparent • Able to define the what are to be returned in searches and the index ranking
Exercises • Hands-on exercises • Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI • Repeat the crawling process without using the crawl command • Modify your configuration to perform each of the following crawl jobs and think when they would be useful. • To crawl only webpages and pdfs but not anything else • To crawl the files on your harddisk • To crawl but not to parse • (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state
Next Meeting • Special Features • Extensible • Distributed • Feedback and discussion
References • http://lucene.apache.org/nutch/ -- Official website • http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) • http://lucene.apache.org/nutch/release/ Nutch source code • www.nutchinstall.blogspot.com Installation guide • http://www.robotstxt.org/wc/robots.html The web robot pages