Extensible Information Retrieval with Apache Nutch

Extensible Information Retrievalwith Apache Nutch Aaron Elkiss 16-Feb-2006

Why use Nutch? • Front-end to large collections of documents • Demonstrate research without writing lots of extra code

Outline • Nutch - information retrieval • Pros & Cons • Crawling the Local Filesystem • How Nutch Works • Indexing a Database • Query Filters: Searching with Nutch

Nutch • Open source search engine • Written in Java • Built on top of Apache Lucene

Advantages of Nutch • Scalable • Index local host or entire Internet • Portable • Runs anywhere with Java • Flexible • Plugin system + API • Code pretty easy to read & work with • Better than implementing it yourself!

Disadvantages of Nutch • Documentation still somewhat lacking • Not yet fully mature • No GUI • Odd Tomcat setup • Several “gotchas”

Crawling the Local Filesystem • Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl

Crawling the Local Filesystem • Step 2: Edit Configuration • crawl-urlfilter.txt • Very restrictive by default • Must allow file: URLs

crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.

crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse .\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.

Crawling the Local Filesystem • Step 3: Edit Configuration • nutch-site.xml (overrides nutch-default.xml) • Enable protocol-file plugin and parse plugins <nutch-conf> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property></nutch-conf>

Crawling the Local Filesystem • Step 4: Run the crawl • bin/nutch crawl myurls • Step 5: Start Tomcat • GOTCHA: must start in the crawl directory! • Or edit WEB-INF/classes/nutch-site.xml <nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property> </nutch-conf>

Modifying the Results Page • Just customize search.jsp! • For example, display external ‘citations’ link instead of ‘anchors’ (<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"> <i18n:message key="explain"/></a>) (<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>) <%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>

How Nutch Works • Protocol plugin Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata

How Nutch Works Parse String text • Parsing plugins ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata Parser. getParse

Indexing a Database • Need to write a new plugin • Luckily interface is pretty simple • Much less tightly coupled than full-text search inside database

Indexing a Database • Approach • Get the text out • Generate a 1:1 mapping from URLs to documents in the database

Indexing a Database • Protocol plugin • Replaces default ‘http’ plugin • Converts http request to database request

Indexing a Database • Parse plugin • Replaces text or HTML parser • Protocol plugin gets the text and metadata, so don’t need to do much here

Indexing a Database • Configuration - plugin.xml

Indexing a Database • Configuration - nutch-site.xml • Add correct plugin • Make sure Nutch can find plugin • $NUTCH_HOME/plugins

Improving the Plugin • Configuration via XML • Determine which database to use for what URLs • Automatically ‘crawl’ database • Pass unknown URLs to default plugin

Searching with Nutch • Parse query - NutchAnalysis • Filter query - QueryFilters • Pass to Lucene - IndexSearcher • Optimization/caching - LuceneQueryOptimizer • Translate hits from Lucene back to Nutch

Query Filter Nutch Query QueryFilter. filter() Lucene Query

Date Query Filter • Date query filter restricts by date

Basic Query Filter • Boosts weight of particular fields • Manipulates phrases

Additional Query Filters • Could implement relevance feedback in this framework • Manual relevance feedback • could add morelike:somedocument operator • Automatic relevance feedback - extend BasicQueryFilter

Additional Capabilities • Distributed searching • Nutch Distributed File System • MapReduce a la Google • More

Nutch Distributed Filesystem • Write-once • Stream-oriented (append-only, sequential read) • Distributed, transparent, replicated, fault-tolerant • Distribute index and content

MapReduce • Distributed processing technique • Idea from functional programming

Map • Apply same operation to several data items • Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff ... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) • Mapping for individual items is independent - distributable!

Reduce • Combine results of map operation • Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))

MapReduce in Nutch • Can use to distribute crawling, indexing, etc

Conclusions • Nutch is • featureful • flexible • extensible • scalable • Get started with nutch: http://lucene.apache.org/nutch • Sample plugins and code samples: http://umich.edu/~aelkiss/nutch

Extensible Information Retrieval with Apache Nutch

Extensible Information Retrieval with Apache Nutch

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Fundamentals of Information Retrieval Illustration with Apache Lucene

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval