340 likes | 516 Views
Extensible Information Retrieval with Apache Nutch. Aaron Elkiss 16-Feb-2006. Why use Nutch?. Front-end to large collections of documents Demonstrate research without writing lots of extra code. Outline. Nutch - information retrieval Pros & Cons Crawling the Local Filesystem
E N D
Extensible Information Retrievalwith Apache Nutch Aaron Elkiss 16-Feb-2006
Why use Nutch? • Front-end to large collections of documents • Demonstrate research without writing lots of extra code
Outline • Nutch - information retrieval • Pros & Cons • Crawling the Local Filesystem • How Nutch Works • Indexing a Database • Query Filters: Searching with Nutch
Nutch • Open source search engine • Written in Java • Built on top of Apache Lucene
Advantages of Nutch • Scalable • Index local host or entire Internet • Portable • Runs anywhere with Java • Flexible • Plugin system + API • Code pretty easy to read & work with • Better than implementing it yourself!
Disadvantages of Nutch • Documentation still somewhat lacking • Not yet fully mature • No GUI • Odd Tomcat setup • Several “gotchas”
Crawling the Local Filesystem • Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl
Crawling the Local Filesystem • Step 2: Edit Configuration • crawl-urlfilter.txt • Very restrictive by default • Must allow file: URLs
crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.
crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse .\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.
Crawling the Local Filesystem • Step 3: Edit Configuration • nutch-site.xml (overrides nutch-default.xml) • Enable protocol-file plugin and parse plugins <nutch-conf> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property></nutch-conf>
Crawling the Local Filesystem • Step 4: Run the crawl • bin/nutch crawl myurls • Step 5: Start Tomcat • GOTCHA: must start in the crawl directory! • Or edit WEB-INF/classes/nutch-site.xml <nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property> </nutch-conf>
Modifying the Results Page • Just customize search.jsp! • For example, display external ‘citations’ link instead of ‘anchors’ (<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"> <i18n:message key="explain"/></a>) (<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>) <%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>
How Nutch Works • Protocol plugin Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata
How Nutch Works Parse String text • Parsing plugins ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata Parser. getParse
Indexing a Database • Need to write a new plugin • Luckily interface is pretty simple • Much less tightly coupled than full-text search inside database
Indexing a Database • Approach • Get the text out • Generate a 1:1 mapping from URLs to documents in the database
Indexing a Database • Protocol plugin • Replaces default ‘http’ plugin • Converts http request to database request
Indexing a Database • Parse plugin • Replaces text or HTML parser • Protocol plugin gets the text and metadata, so don’t need to do much here
Indexing a Database • Configuration - plugin.xml
Indexing a Database • Configuration - nutch-site.xml • Add correct plugin • Make sure Nutch can find plugin • $NUTCH_HOME/plugins
Improving the Plugin • Configuration via XML • Determine which database to use for what URLs • Automatically ‘crawl’ database • Pass unknown URLs to default plugin
Searching with Nutch • Parse query - NutchAnalysis • Filter query - QueryFilters • Pass to Lucene - IndexSearcher • Optimization/caching - LuceneQueryOptimizer • Translate hits from Lucene back to Nutch
Query Filter Nutch Query QueryFilter. filter() Lucene Query
Date Query Filter • Date query filter restricts by date
Basic Query Filter • Boosts weight of particular fields • Manipulates phrases
Additional Query Filters • Could implement relevance feedback in this framework • Manual relevance feedback • could add morelike:somedocument operator • Automatic relevance feedback - extend BasicQueryFilter
Additional Capabilities • Distributed searching • Nutch Distributed File System • MapReduce a la Google • More
Nutch Distributed Filesystem • Write-once • Stream-oriented (append-only, sequential read) • Distributed, transparent, replicated, fault-tolerant • Distribute index and content
MapReduce • Distributed processing technique • Idea from functional programming
Map • Apply same operation to several data items • Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff ... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) • Mapping for individual items is independent - distributable!
Reduce • Combine results of map operation • Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))
MapReduce in Nutch • Can use to distribute crawling, indexing, etc
Conclusions • Nutch is • featureful • flexible • extensible • scalable • Get started with nutch: http://lucene.apache.org/nutch • Sample plugins and code samples: http://umich.edu/~aelkiss/nutch