140 likes | 307 Views
Lucene & Nutch. Lucene Project name Started as text index engine Nutch A complete web search engine, including: Crawling, indexing, searching Index 100M+ pages, crawl >10M/day Provide distributed architecture Written in JAVA Other language ports are work-in-progress. Lucene.
E N D
Lucene & Nutch • Lucene • Project name • Started as text index engine • Nutch • A complete web search engine, including: • Crawling, indexing, searching • Index 100M+ pages, crawl >10M/day • Provide distributed architecture • Written in JAVA • Other language ports are work-in-progress
Lucene • Open source search project • http://lucene.apache.org • Index & search local files • Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ • Extract files • Build an index for a directory • java org.apache.lucene.demo.IndexFiles dir_path • Try search at command line: • java org.apache.lucene.demo.SearchFiles
Deploy Lucene • Copy luceneweb.war to your {tomcat-home}/webapps • Browse to http://localhost:8080/luceneweb • Tomcat will deploy the web app. • Edit webapps/luceneweb/configuration.jsp • Point “indexLocation”to your indexes • Search at http://localhost:8080/luceneweb
Nutch • A complete search engine http://lucene.apache.org/nutch/release/ • Mode • Intranet/local search • Internet search • Usage • Crawl • Index • Search
Intranet Search • Configuration • Input URLs: create a directory and seed file • $ mkdir urls • $ echo http://www.cs.ucsb.edu > urls/ucsb • Edit conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with cs.ucsb.edu • Edit conf/nutch-site.xml
Intranet: Running the Crawl • Crawl options include: -dir dir names the directory to put the crawl in. -threads threads determines the number of threads that will fetch in parallel. -depth depth indicates the link depth from the root page that should be crawled. -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. • E.g. $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Intranet Search • Deploy nutch war file • rm -rf TOMCAT_DIR/webapps/ROOT* • cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war • The webapp finds indexes in ./crawl, relative to where you start Tomcat • TOMCAT_DIR/bin/catalina.sh start • Search at http://localhost:8080/ • CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080
Internet Crawling • Concept • crawldb: all URL info • linkdb: list of known links to each url • segments: each is a set of urls that are fetched as a unit • indexes: Lucene-format indexes
Internet Crawling Process • Get seed URLs • Fetch • Update crawl DB • Compute top URLs, goto 2 • Create Index • Deploy
Seed URL • URLs from the DMOZ Open Directory • wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz • gunzip content.rdf.u8.gz • mkdir dmoz • bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls • Kids search URL from ask.com • Inject URLs • bin/nutch inject kids/crawldb 67k-url/ • Edit conf/nutch-site.xml
Fetch • Generate a fetchlist from the database • $ bin/nutch generate kids/crawldb kids/segments • Save the name of fetchlist in variable s1 • s1=`ls -d kids/segments/2* | tail -1` • Run the fetcher on this segment • bin/nutch fetch $s1
Update Crawl DB and Re-fetch • Update craw db with the results of the fetch • bin/nutch updatedb kids/crawldb $s1 • Generate top-scoring 50K pages • bin/nutch generate kids/crawldb kids/segments -topN 50000 • Refetch • s1=`ls -d kids/segments/2* | tail -1` • bin/nutch fetch $s1
Index, Deploy, and Search • Create inverted index • bin/nutch invertlinks kids/linkdb kids/segments/* • Index the segments • bin/nutch index kids/indexes kids/crawldb kids/linkdb kids/segments/* • Deploy & Search • Same as in Intranet search • Demo of 1M pages (570K + 500K)
Issues • Default crawling cycle is 30 days for all URLs • Duplicates are those have same URL or md5 of page content • JavaScript parser uses regular expression to extract URL literals from code.