1 / 14

Lucene & Nutch

Lucene & Nutch. Lucene Project name Started as text index engine Nutch A complete web search engine, including: Crawling, indexing, searching Index 100M+ pages, crawl >10M/day Provide distributed architecture Written in JAVA Other language ports are work-in-progress. Lucene.

gisela
Download Presentation

Lucene & Nutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene & Nutch • Lucene • Project name • Started as text index engine • Nutch • A complete web search engine, including: • Crawling, indexing, searching • Index 100M+ pages, crawl >10M/day • Provide distributed architecture • Written in JAVA • Other language ports are work-in-progress

  2. Lucene • Open source search project • http://lucene.apache.org • Index & search local files • Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ • Extract files • Build an index for a directory • java org.apache.lucene.demo.IndexFiles dir_path • Try search at command line: • java org.apache.lucene.demo.SearchFiles

  3. Deploy Lucene • Copy luceneweb.war to your {tomcat-home}/webapps • Browse to http://localhost:8080/luceneweb • Tomcat will deploy the web app. • Edit webapps/luceneweb/configuration.jsp • Point “indexLocation”to your indexes • Search at http://localhost:8080/luceneweb

  4. Nutch • A complete search engine http://lucene.apache.org/nutch/release/ • Mode • Intranet/local search • Internet search • Usage • Crawl • Index • Search

  5. Intranet Search • Configuration • Input URLs: create a directory and seed file • $ mkdir urls • $ echo http://www.cs.ucsb.edu > urls/ucsb • Edit conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with cs.ucsb.edu • Edit conf/nutch-site.xml

  6. Intranet: Running the Crawl • Crawl options include: -dir dir names the directory to put the crawl in. -threads threads determines the number of threads that will fetch in parallel. -depth depth indicates the link depth from the root page that should be crawled. -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. • E.g. $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

  7. Intranet Search • Deploy nutch war file • rm -rf TOMCAT_DIR/webapps/ROOT* • cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war • The webapp finds indexes in ./crawl, relative to where you start Tomcat • TOMCAT_DIR/bin/catalina.sh start • Search at http://localhost:8080/ • CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080

  8. Internet Crawling • Concept • crawldb: all URL info • linkdb: list of known links to each url • segments: each is a set of urls that are fetched as a unit • indexes: Lucene-format indexes

  9. Internet Crawling Process • Get seed URLs • Fetch • Update crawl DB • Compute top URLs, goto 2 • Create Index • Deploy

  10. Seed URL • URLs from the DMOZ Open Directory • wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz • gunzip content.rdf.u8.gz • mkdir dmoz • bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls • Kids search URL from ask.com • Inject URLs • bin/nutch inject kids/crawldb 67k-url/ • Edit conf/nutch-site.xml

  11. Fetch • Generate a fetchlist from the database • $ bin/nutch generate kids/crawldb kids/segments • Save the name of fetchlist in variable s1 • s1=`ls -d kids/segments/2* | tail -1` • Run the fetcher on this segment • bin/nutch fetch $s1

  12. Update Crawl DB and Re-fetch • Update craw db with the results of the fetch • bin/nutch updatedb kids/crawldb $s1 • Generate top-scoring 50K pages • bin/nutch generate kids/crawldb kids/segments -topN 50000 • Refetch • s1=`ls -d kids/segments/2* | tail -1` • bin/nutch fetch $s1

  13. Index, Deploy, and Search • Create inverted index • bin/nutch invertlinks kids/linkdb kids/segments/* • Index the segments • bin/nutch index kids/indexes kids/crawldb kids/linkdb kids/segments/* • Deploy & Search • Same as in Intranet search • Demo of 1M pages (570K + 500K)‏

  14. Issues • Default crawling cycle is 30 days for all URLs • Duplicates are those have same URL or md5 of page content • JavaScript parser uses regular expression to extract URL literals from code.

More Related