1 / 21

Nutch Tutorial: Apache's Open-Source Search Engine Solution

Learn about Nutch, an open-source solution for web crawling and indexing using Lucene/Solr components. Explore its features, connection steps, crawling techniques, and searching capabilities. Access the server and modify settings to customize Nutch behavior for your needs.

Download Presentation

Nutch Tutorial: Apache's Open-Source Search Engine Solution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Nutch Tutorial IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.

  2. What is Nutch? Apache has open-source solution for two components of Search Engines Crawler: Nutch Indexer: Lucene  Solr  Lucene/Solr (merged in 2010) A project headed by Doug Cutting To make an open-source search engine expandable enough to index the entire web (~ billions) Nutch includes Java crawler HTML parser + Lucene search/index library + lots more IST 516

  3. Features of Nutch Robot crawler, can use proxy Includes hosts via grep, exclusion by host names and suffixes Continuous indexing FTP indexing login option Index logging options Flexible query parsing Includes link-analysis module (mainly for multi-site search) Includes approximately fifteen relevance quality adjustment options Caches original page for display IST 516

  4. There are two paths (index path & query path) through a search engine The index path shows how the index gets filled with documents. The documents are fed to an analyzer which then transforms them into the appropriate weighted terms (or scores) and passes them to the IndexWriter Workflow of Nutch IST 516

  5. Connection Steps • For security reasons, ist516 server is only accessible from IST’s VLabs • First, login to IST’s VLabs environment • Second, from VLabs, login to ist516 server IST 516

  6. Connecting to VLabs • From Windows/Mac remote-desktop, login to VLabs using your PSU ID/PWD • Note “UP\PSU-ID” for the user-name below IST 516

  7. Connecting to ist516.ist.psu.edu • A UNIX server is prepared for proj #2 • Ist516.ist.psu.edu ( • Can be accessed via SSH protocol only • If not pre-installed, get a SSH client from https://downloads.its.psu.edu/ "File Transfer” IST 516

  8. Connecting to ist516.ist.psu.edu • If a SSH client is pre-installed in VLabs, use it • “Quick connect”  use the provided team ID/PWD IST 516

  9. Ist516.ist.psu.edu • Tomcat (Apache’s web server) and Nutch are already installed in the server • Under each team's home directory (eg, /home/team-ID/nutch-1.0) • Modify things under "nutch-1.0/conf" to change the behavior of Nutch as you wish IST 516

  10. Running Tomcat and Nutch • To start or stop Tomcat server, all you need to do is to type: start-tomcat and stop-tomcat • To run Nutch, at the command line, just type: nutch or you can provide various parameters like: nutch [parameters] • The server has the most of typical UNIX software installed, including: • wget: to download things using URL address • nano: a small editor which Windows users may find it useful/familiar • Emacs: full-fledged powerful UNIX editor IST 516

  11. Crawling in Nutch There are two approaches to crawling: Intranet crawling, with the crawl command. Whole-web crawling, with much greater control, using the lower level inject, generate, fetch and updatedb commands Intranet crawling is more suitable for small-scale project IST 516

  12. 1. Intranet Crawling Create a text file, say urlfile.txt, containing some seed URLs. Eg, http://pike.psu.edu/ Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl Eg, if you wish to limit the crawl to the pike.psu.edu domain, the line should read: +^http://([a-z0-9]*\.)*pike.psu.edu/ This will include any URLs in the domain pike.psu.edu IST 516

  13. 1. Intranet Crawling • Edit the file conf/nutch-site.xml accordingly • At least, insert the following properties and edit in proper values for the properties: <property> <name>http.agent.name</name> <value>YOUR-CRAWLER-NAME-HERE</value> <description></description> </property> IST 516

  14. 1. Intranet Crawling Use the crawl command for crawling. Its options include: -dir: names the directory to put the crawl in -depth: indicates the link depth from the root page that should be crawled -delay: determines the number of seconds between accesses to each host -threads: determines the number of threads that will fetch in parallel Eg, a typical call might be: > nutch crawl urlfile.txt -dir crawl.test -depth 3 >& log IST 516

  15. 1. Intranet Crawling The indexer uses the downloaded contents to generate an inverted index of all terms and all pages The document set is divided into a set of index segments, each of which is fed to a single searcher process Each searcher also draws upon the Web content from earlier, so it can provide a cached copy of any Web page IST 516

  16. 2. Internet Crawling More steps are needed than intranet crawling Explore it for your proj #2 Refer to: http://wiki.apache.org/nutch/NutchTutorial IST 516

  17. 3. Searching Tomcat is installed and each of your group has your own webapp directory, which holds the nutch war file To search, put the nutch war file into your servlet container. > cp ~/nutch-0.9/nutch*.war ~/tomcat/webapps/ROOT.war Go to the directory that your crawler created and run the Tomcat server: > cd crawl.test > start-tomcat IST 516

  18. 3. Searching Connect your browser to: http://ist516.ist.psu.edu:900? ? is your group number Eg, Team1: http://ist516.ist.psu.edu:9001/ To access this URL, students need to log in to VLabs first and access from there: vlabs.up.ist.psu.edu + PSU ID/PWD Refer to VLabs Tutorial for more details: http://pike.psu.edu/classes/ist516/2010-fall/s/slides/vlabs-tutorial.ppt IST 516

  19. 3. Searching IST 516

  20. Editing Nutch Look • To change the look & feel of search interface • Search.html is automatically generated • Instead, change XML files directly: • ~/nutch-1.0/src/web/pages/en/search.xml • ~/nutch-1.0/src/web/pages/en/about.xml • ~/nutch-1.0/src/web/pages/en/help.xml • More details on how to edit Nutch look, see here: • http://www.stevekallestad.com/wiki/Editing_nutch IST 516

  21. Reference Apache’s Official Nutch Tutorial http://wiki.apache.org/nutch/NutchTutorial Peter Wang’s Nutch Tutorial http://zillionics.com/resources/articles/NutchGuideForDummies.htm IST 441’s Nutch Tutorialhttp://clgiles.ist.psu.edu/IST441/materials/nutch-lucene/nutch-crawling-and-searching.pdf IST 516

More Related