Lucene & Nutch

Lucene & Nutch • Lucene • Project name • Started as text index engine • Nutch • A complete web search engine, including: • Crawling, indexing, searching • Index 100M+ pages, crawl >10M/day • Provide distributed architecture • Written in JAVA • Other language ports are work-in-progress

Lucene • Open source search project • http://lucene.apache.org • Index & search local files • Download lucene-2.2.0.tar.gz from http://www.apache.org/dyn/closer.cgi/lucene/java/ • Extract files • Build an index for a directory • java org.apache.lucene.demo.IndexFiles dir_path • Try search at command line: • java org.apache.lucene.demo.SearchFiles

Deploy Lucene • Copy luceneweb.war to your {tomcat-home}/webapps • Browse to http://localhost:8080/luceneweb • Tomcat will deploy the web app. • Edit webapps/luceneweb/configuration.jsp • Point “indexLocation”to your indexes • Search at http://localhost:8080/luceneweb

Nutch • A complete search engine http://lucene.apache.org/nutch/release/ • Mode • Intranet/local search • Internet search • Usage • Crawl • Index • Search

Intranet Search • Configuration • Input URLs: create a directory and seed file • $ mkdir urls • $ echo http://www.cs.ucsb.edu > urls/ucsb • Edit conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with cs.ucsb.edu • Edit conf/nutch-site.xml

Intranet: Running the Crawl • Crawl options include: -dir dir names the directory to put the crawl in. -threads threads determines the number of threads that will fetch in parallel. -depth depth indicates the link depth from the root page that should be crawled. -topN N determines the maximum number of pages that will be retrieved at each level up to the depth. • E.g. $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Intranet Search • Deploy nutch war file • rm -rf TOMCAT_DIR/webapps/ROOT* • cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war • The webapp finds indexes in ./crawl, relative to where you start Tomcat • TOMCAT_DIR/bin/catalina.sh start • Search at http://localhost:8080/ • CS.UCSB domain demo: http://hactar.cs.ucsb.edu:8080

Internet Crawling • Concept • crawldb: all URL info • linkdb: list of known links to each url • segments: each is a set of urls that are fetched as a unit • indexes: Lucene-format indexes

Internet Crawling Process • Get seed URLs • Fetch • Update crawl DB • Compute top URLs, goto 2 • Create Index • Deploy

Seed URL • URLs from the DMOZ Open Directory • wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz • gunzip content.rdf.u8.gz • mkdir dmoz • bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls • Kids search URL from ask.com • Inject URLs • bin/nutch inject kids/crawldb 67k-url/ • Edit conf/nutch-site.xml

Fetch • Generate a fetchlist from the database • $ bin/nutch generate kids/crawldb kids/segments • Save the name of fetchlist in variable s1 • s1=`ls -d kids/segments/2* | tail -1` • Run the fetcher on this segment • bin/nutch fetch $s1

Update Crawl DB and Re-fetch • Update craw db with the results of the fetch • bin/nutch updatedb kids/crawldb $s1 • Generate top-scoring 50K pages • bin/nutch generate kids/crawldb kids/segments -topN 50000 • Refetch • s1=`ls -d kids/segments/2* | tail -1` • bin/nutch fetch $s1

Index, Deploy, and Search • Create inverted index • bin/nutch invertlinks kids/linkdb kids/segments/* • Index the segments • bin/nutch index kids/indexes kids/crawldb kids/linkdb kids/segments/* • Deploy & Search • Same as in Intranet search • Demo of 1M pages (570K + 500K)‏

Issues • Default crawling cycle is 30 days for all URLs • Duplicates are those have same URL or md5 of page content • JavaScript parser uses regular expression to extract URL literals from code.

Lucene & Nutch

Lucene & Nutch

Presentation Transcript

Searching CiteSeer Metadata Using Nutch

Introduction to Apache Lucene/Solr

An introduction to Solr

Lucene Tutorial Chris Manning and Pandu Nayak

Intelligent Apps with Apache Lucene, Mahout and friends

Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan

Goat search

Implementing Local Search with Apache Solr and Lucene

Open Source IR Tools and Libraries

Apache Lucene and Apache Solr Performance Tuning

WBIA Project 2 â€“ Retrieval & Evaluation

Nutch in a Nutshell (part I)

Using the Lucene Search Engine

Building a Real-time , Solr -powered Recommendation Engine

Advanced Lucene

Lucene/SOLR 2: Lucene search API

SRU and Lucene

Implementing Autocomplete with Solr and jQuery

Full-Text Search with Lucene

Lucene &amp; Nutch

Lucene &amp; Nutch

Presentation Transcript

Searching CiteSeer Metadata Using Nutch

Introduction to Apache Lucene/Solr

An introduction to Solr

Lucene Tutorial Chris Manning and Pandu Nayak

Intelligent Apps with Apache Lucene, Mahout and friends

Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan

Goat search

Implementing Local Search with Apache Solr and Lucene

Open Source IR Tools and Libraries

Apache Lucene and Apache Solr Performance Tuning

WBIA Project 2 â€“ Retrieval &amp; Evaluation

Nutch in a Nutshell (part I)

Using the Lucene Search Engine

Building a Real-time , Solr -powered Recommendation Engine

Advanced Lucene

Lucene/SOLR 2: Lucene search API

SRU and Lucene

Implementing Autocomplete with Solr and jQuery

Full-Text Search with Lucene

Lucene & Nutch

Lucene & Nutch

WBIA Project 2 â€“ Retrieval & Evaluation