1 / 34

Extensible Information Retrieval with Apache Nutch

Extensible Information Retrieval with Apache Nutch. Aaron Elkiss 16-Feb-2006. Why use Nutch?. Front-end to large collections of documents Demonstrate research without writing lots of extra code. Outline. Nutch - information retrieval Pros & Cons Crawling the Local Filesystem

dewei
Download Presentation

Extensible Information Retrieval with Apache Nutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extensible Information Retrievalwith Apache Nutch Aaron Elkiss 16-Feb-2006

  2. Why use Nutch? • Front-end to large collections of documents • Demonstrate research without writing lots of extra code

  3. Outline • Nutch - information retrieval • Pros & Cons • Crawling the Local Filesystem • How Nutch Works • Indexing a Database • Query Filters: Searching with Nutch

  4. Nutch • Open source search engine • Written in Java • Built on top of Apache Lucene

  5. Advantages of Nutch • Scalable • Index local host or entire Internet • Portable • Runs anywhere with Java • Flexible • Plugin system + API • Code pretty easy to read & work with • Better than implementing it yourself!

  6. Disadvantages of Nutch • Documentation still somewhat lacking • Not yet fully mature • No GUI • Odd Tomcat setup • Several “gotchas”

  7. Crawling the Local Filesystem • Step 1: Create list of files to index file_list: /data0/projects/clairlib/CLAIR/aleClairlib.pl /data0/projects/clairlib/CLAIR/buildALE.pl /data0/projects/clairlib/CLAIR/get_cosine_example.pl /data0/projects/clairlib/CLAIR/lookUpTFIDF.pl /data0/projects/clairlib/CLAIR/makeCorpus.pl /data0/projects/clairlib/CLAIR/normalize_cosines.pl /data0/projects/clairlib/CLAIR/queryALE.pl /data0/projects/clairlib/CLAIR/testCluster.pl /data0/projects/clairlib/CLAIR/testCorpusDownload.pl /data0/projects/clairlib/CLAIR/testDocument.pl /data0/projects/clairlib/CLAIR/testDocumentPair.pl /data0/projects/clairlib/CLAIR/testIP.pl /data0/projects/clairlib/CLAIR/testUtil.pl /data0/projects/clairlib/CLAIR/testWebSearch.pl /data0/projects/clairlib/CLAIR/NSIR/bin/testNSIR.pl /data0/projects/clairlib/CLAIR/NSIR/bin/nsir_web.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Parser.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/Tnt2PreCass.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanEmptySentences.pl /data0/projects/clairlib/CLAIR/NSIR/lib/NSIR/utilities/cleanPunctuation_tnt.pl

  8. Crawling the Local Filesystem • Step 2: Edit Configuration • crawl-urlfilter.txt • Very restrictive by default • Must allow file: URLs

  9. crawl-urlfilter.txt default # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ # skip everything else -.

  10. crawl-urlfilter.txt # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip image and other suffixes we can't yet parse .\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # allow everything else +.

  11. Crawling the Local Filesystem • Step 3: Edit Configuration • nutch-site.xml (overrides nutch-default.xml) • Enable protocol-file plugin and parse plugins <nutch-conf> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-file|urlfilter-regex|parse-(text|html|pdf|msword)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property></nutch-conf>

  12. Crawling the Local Filesystem • Step 4: Run the crawl • bin/nutch crawl myurls • Step 5: Start Tomcat • GOTCHA: must start in the crawl directory! • Or edit WEB-INF/classes/nutch-site.xml <nutch-conf> <property> <name>searcher.dir</name> <value>/oriole0/nutch-0.7.1/crawl-20051208231019</value> </property> </nutch-conf>

  13. Modifying the Results Page • Just customize search.jsp! • For example, display external ‘citations’ link instead of ‘anchors’ (<a href="../explain.jsp?<%=id%>&query=<%=URLEncoder.encode(queryString)%>"> <i18n:message key="explain"/></a>) (<a href="http://oriole.eecs.umich.edu/cgi-bin/citations.pl?<%=url%>">citations</a>) <%-- (<a href="../anchors.jsp?<%=id%>"><i18n:message key="anchors"/></a>) --%>

  14. How Nutch Works • Protocol plugin Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata

  15. How Nutch Works Parse String text • Parsing plugins ParseData data Properties metadata Outlink[] outlinks String title ParseStatus status Protocol. getProtocolOutput URL Content byte[] content String contentType URL url Properties metadata Parser. getParse

  16. Indexing a Database • Need to write a new plugin • Luckily interface is pretty simple • Much less tightly coupled than full-text search inside database

  17. Indexing a Database • Approach • Get the text out • Generate a 1:1 mapping from URLs to documents in the database

  18. Indexing a Database • Protocol plugin • Replaces default ‘http’ plugin • Converts http request to database request

  19. Indexing a Database • Parse plugin • Replaces text or HTML parser • Protocol plugin gets the text and metadata, so don’t need to do much here

  20. Indexing a Database • Configuration - plugin.xml

  21. Indexing a Database • Configuration - nutch-site.xml • Add correct plugin • Make sure Nutch can find plugin • $NUTCH_HOME/plugins

  22. Improving the Plugin • Configuration via XML • Determine which database to use for what URLs • Automatically ‘crawl’ database • Pass unknown URLs to default plugin

  23. Searching with Nutch • Parse query - NutchAnalysis • Filter query - QueryFilters • Pass to Lucene - IndexSearcher • Optimization/caching - LuceneQueryOptimizer • Translate hits from Lucene back to Nutch

  24. Query Filter Nutch Query QueryFilter. filter() Lucene Query

  25. Date Query Filter • Date query filter restricts by date

  26. Basic Query Filter • Boosts weight of particular fields • Manipulates phrases

  27. Additional Query Filters • Could implement relevance feedback in this framework • Manual relevance feedback • could add morelike:somedocument operator • Automatic relevance feedback - extend BasicQueryFilter

  28. Additional Capabilities • Distributed searching • Nutch Distributed File System • MapReduce a la Google • More

  29. Nutch Distributed Filesystem • Write-once • Stream-oriented (append-only, sequential read) • Distributed, transparent, replicated, fault-tolerant • Distribute index and content

  30. MapReduce • Distributed processing technique • Idea from functional programming

  31. Map • Apply same operation to several data items • Example (Python): def getDocument(docid): """ fetch document with given docid from database """ # do some stuff ... return document docids = [1, 2, 3, 4, 5] documents = map(getDocument,docids) • Mapping for individual items is independent - distributable!

  32. Reduce • Combine results of map operation • Simple example - sum of squares measurements = [4, 2, 6, 9] def sum(x,y): return x+y def square(x): return x^2 result = reduce(sum,map(square,measurements))

  33. MapReduce in Nutch • Can use to distribute crawling, indexing, etc

  34. Conclusions • Nutch is • featureful • flexible • extensible • scalable • Get started with nutch: http://lucene.apache.org/nutch • Sample plugins and code samples: http://umich.edu/~aelkiss/nutch

More Related