Nutch in a Nutshell (part I)

Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin

Outline • Overview • Nutch as a web crawler • Nutch as a complete web search engine • Special features • Installation/Usage (with Demo) • Exercises

Overview • Complete web search engine • Nutch = Crawler + Indexer/Searcher (Lucene) + GUI + Plugins + MapReduce & Distributed FS (Hadoop) • Java based, open source • Features: • Customizable • Extensible (Next meeting) • Distributed (Next meeting)

Injector Fetcher Parser CrawlDBTool Generator Web Nutch as a crawler Initial URLs CrawlDB Webpages/files update get read/write generate read/write Segment

CrawlDB LinkDB Indexer Searcher Nutch as a complete web search engine Segments (Lucene) Index (Lucene) GUI (Tomcat)

Special Features • Customizable • Configuration files (XML) • Required user parameters • http.agent.name • http.agent.description • http.agent.url • http.agent.email • Adjustable parameters for every component • E.g. for fetcher: • Threads-per-host • Threads-per-ip

Special Features • URL Filters (Text file) • Regular expression to filter URLs during crawling • E.g. • To ignore files with certain suffix: -\.(gif|exe|zip|ico)$ • To accept host in a certain domain +^http://([a-z0-9]*\.)*apache.org/ • Plugin-information (XML) • The metadata of the plugins (More details next week)

Installation & Usage • Installation • Software needed • Nutch release • Java • Apache Tomcat (for GUI) • Cgywin (for windows)

Installation & Usage • Usage • Crawling • Initial URLs (text file or DMOZ file) • Required parameters (conf/nutch-site.xml) • URL filters (conf/crawl-urlfilter.txt) • Indexing • Automatic • Searching • Location of files (WAR file, index) • The tomcat server

Demo time!

Exercises • Questions: • What are the things that need to be done before starting a crawl job with Nutch? • What are the ways tell Nutch what to crawl and what not? What can you do if you are the owner of a website? • Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • What do you think are good crawling behaviors? • Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • What are the advantages of using Nutch instead of commercial search engines?

Answers • What are the things that need to be done before starting a crawl job with Nutch? • Set the CLASSPATH to the Lucene Core • Set the JAVA_HOME path • Create a folder containing urls to be crawled • Amend the crawl-urlfilter file • Amend the nutch-site.xml file to include the user parameters

What are the ways tell Nutch what to crawl and what not? • Url filters • Depth in crawling • Scoring function for urls • What can you do if you are the owner of a website? • Web Server Administrators • Use the Robot Exclusion Protocol by adding the following in /robots.txt • HTML Author • Add the Robots META tag

Starting from v0.8, Nutch won’t run unless some minimum user parameters, such as http.robots.agents, are set, what do you think is the reason behind? • To ensure accountability (although tracing is still possible without them) • What do you think are good crawling behaviors? • Be Accountable • Test Locally • Don't hog resources • Stay with it • Share results

Do you think an open-sourced search engine like Nutch would make it easier for spammers to manipulate the search index ranking? • True but one can always make changes in Nutch to minimize the effect. • What are the advantages of using Nutch instead of commercial search engines? • Open-source • Transparent • Able to define the what are to be returned in searches and the index ranking

Exercises • Hands-on exercises • Install Nutch, crawl a few webpages using the crawl command and perform a search on it using the GUI • Repeat the crawling process without using the crawl command • Modify your configuration to perform each of the following crawl jobs and think when they would be useful. • To crawl only webpages and pdfs but not anything else • To crawl the files on your harddisk • To crawl but not to parse • (Challenging) Modify Nutch such that you can unpack the crawled files in the segments back into their original state

Q&A?

Next Meeting • Special Features • Extensible • Distributed • Feedback and discussion

References • http://lucene.apache.org/nutch/ -- Official website • http://wiki.apache.org/nutch/ -- Nutch wiki (Seriously outdated. Take with a grain of salt.) • http://lucene.apache.org/nutch/release/ Nutch source code • www.nutchinstall.blogspot.com Installation guide • http://www.robotstxt.org/wc/robots.html The web robot pages

Thank you!

Nutch in a Nutshell (part I)

Nutch in a Nutshell (part I)

Presentation Transcript

Optical Mineralogy in a Nutshell

Environmental History in a Nutshell

Disaster Recovery and IIS 6.0: Metabase Backups in a Nutshell

Optical Mineralogy in a Nutshell

Optical Mineralogy in a Nutshell

C3…in a nutshell!

Backup strategies “in-a-nutshell” by System Center

My life in a nutshell.

Optical Mineralogy in a Nutshell

WWII in a Nutshell

CV in a Nutshell (I)

ED 227 in a Nutshell

In a nutshell . . .

‘C’ in a Nutshell

Me in a Nutshell

FIN 331 in a Nutshell

“My Life in a Nutshell”

Backup strategies “in-a-nutshell” by System Center

Disaster Recovery and IIS 6.0: Metabase Backups in a Nutshell

Optical Mineralogy in a Nutshell

Your Speaker in a Nutshell

In a Nutshell