Nutch Tutorial: Apache's Open-Source Search Engine Solution

Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

What is Nutch? Apache has open-source solution for two components of Search Engines Crawler: Nutch Indexer: Lucene  Solr  Lucene/Solr (merged in 2010) A project headed by Doug Cutting To make an open-source search engine expandable enough to index the entire web (~ billions) Nutch includes Java crawler HTML parser + Lucene search/index library + lots more IST 516

Features of Nutch Robot crawler, can use proxy Includes hosts via grep, exclusion by host names and suffixes Continuous indexing FTP indexing login option Index logging options Flexible query parsing Includes link-analysis module (mainly for multi-site search) Includes approximately fifteen relevance quality adjustment options Caches original page for display IST 516

There are two paths (index path & query path) through a search engine The index path shows how the index gets filled with documents. The documents are fed to an analyzer which then transforms them into the appropriate weighted terms (or scores) and passes them to the IndexWriter Workflow of Nutch IST 516

Connection Steps • For security reasons, ist516 server is only accessible from IST’s VLabs • First, login to IST’s VLabs environment • Second, from VLabs, login to ist516 server IST 516

Connecting to VLabs • From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD • Note “UP\PSU-ID” for the user-name below IST 516

Connecting to ist516.ist.psu.edu • A UNIX server is prepared for proj #2 • Ist516.ist.psu.edu (130.203.136.10) • Can be accessed via SSH protocol only • If not pre-installed, get a SSH client from https://downloads.its.psu.edu/ "File Transfer” IST 516

Connecting to ist516.ist.psu.edu • If a SSH client is pre-installed in VLabs, use it • “Quick connect”  use the provided team ID/PWD IST 516

Ist516.ist.psu.edu • Tomcat (Apache’s web server) and Nutch are already installed in the server • Under each team's home directory (eg, /home/team-ID/nutch-1.0) • Modify things under "nutch-1.0/conf" to change the behavior of Nutch as you wish IST 516

Running Tomcat and Nutch • To start or stop Tomcat server, all you need to do is to type: start-tomcat and stop-tomcat • To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] • The server has the most of typical UNIX software installed, including: • wget: to download things using URL address • nano: a small editor which Windows users may find it useful/familiar • Emacs: full-fledged powerful UNIX editor IST 516

Crawling in Nutch There are two approaches to crawling: Intranet crawling, with the crawl command. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands Intranet crawling is more suitable for small-scale project IST 516

1. Intranet Crawling Create a text file, say urlfile.txt, containing some seed URLs. Eg, http://pike.psu.edu/ Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl Eg, if you wish to limit the crawl to the pike.psu.edu domain, the line should read: +^http://([a-z0-9]*\.)*pike.psu.edu/ This will include any URLs in the domain pike.psu.edu IST 516

1. Intranet Crawling • Edit the file conf/nutch-site.xml accordingly • At least, insert the following properties and edit in proper values for the properties: <property> <name>http.agent.name</name> <value>YOUR-CRAWLER-NAME-HERE</value> <description></description> </property> IST 516

1. Intranet Crawling Use the crawl command for crawling. Its options include: -dir: names the directory to put the crawl in -depth: indicates the link depth from the root page that should be crawled -delay: determines the number of seconds between accesses to each host -threads: determines the number of threads that will fetch in parallel Eg, a typical call might be: > nutch crawl urlfile.txt -dir crawl.test -depth 3 >& log IST 516

1. Intranet Crawling The indexer uses the downloaded contents to generate an inverted index of all terms and all pages The document set is divided into a set of index segments, each of which is fed to a single searcher process Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page IST 516

2. Internet Crawling More steps are needed than intranet crawling Explore it for your proj #2 Refer to: http://wiki.apache.org/nutch/NutchTutorial IST 516

3. Searching Tomcat is installed and each of your group has your own webapp directory, which holds the nutch war file To search, put the nutch war file into your servlet container. > cp ~/nutch-0.9/nutch*.war ~/tomcat/webapps/ROOT.war Go to the directory that your crawler created and run the Tomcat server: > cd crawl.test > start-tomcat IST 516

3. Searching Connect your browser to: http://ist516.ist.psu.edu:900? ? is your group number Eg, Team1: http://ist516.ist.psu.edu:9001/ To access this URL, students need to log in to VLabs first and access from there: vlabs.up.ist.psu.edu + PSU ID/PWD Refer to VLabs Tutorial for more details: http://pike.psu.edu/classes/ist516/2010-fall/s/slides/vlabs-tutorial.ppt IST 516

3. Searching IST 516

Editing Nutch Look • To change the look & feel of search interface • Search.html is automatically generated • Instead, change XML files directly: • ~/nutch-1.0/src/web/pages/en/search.xml • ~/nutch-1.0/src/web/pages/en/about.xml • ~/nutch-1.0/src/web/pages/en/help.xml • More details on how to edit Nutch look, see here: • http://www.stevekallestad.com/wiki/Editing_nutch IST 516

Reference Apache’s Official Nutch Tutorial http://wiki.apache.org/nutch/NutchTutorial Peter Wang’s Nutch Tutorial http://zillionics.com/resources/articles/NutchGuideForDummies.htm IST 441’s Nutch Tutorialhttp://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf IST 516

Nutch Tutorial: Apache's Open-Source Search Engine Solution