1 / 20

Introduction to Nutch

Introduction to Nutch. CSCI 572: Information Retrieval and Search Engines Summer 2011. Outline. What is Nutch? Motivation Architecture What currently exists? How I got involved Deploying Nutch on NASA’s Planetary Data System (PDS) Free text “Google-like” search of the PDS catalog

aira
Download Presentation

Introduction to Nutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2011

  2. Outline • What is Nutch? • Motivation • Architecture • What currently exists? • How I got involved • Deploying Nutch on NASA’s Planetary Data System (PDS) • Free text “Google-like” search of the PDS catalog • Architecture/Implementation

  3. What is Nutch? • The brainchild of Doug Cutting • Research/programmer guru who has worked at several high profile research labs (Yahoo, Bell Labs) • Nutch builds upon Cutting’s lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene

  4. Motivation • Observation: Web Search is a commodity • Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

  5. Motivation • Value-added capabilities • Improving fetching speed • Parsing and handling of the hundreds of different content types available on the internet • Handling different protocols for obtaining content • Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

  6. Nutch’s Architecture • Nutch Core facilities • Parsing • Indexing • Crawling • Content Management • Querying • Plugin Framework • Nutch’s extension points • Scoring, Parsing, Indexing, Querying, URLFiltering

  7. Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

  8. What Currently Exists? • Version 0.6.x • First easily deployable version • Version 0.7.x • Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system • Version 0.8.x • Completely new underlying architecture based on Hadoop • Parse plugins framework, multi-valued metadata container • Parser Factory enhancement • Version 0.9.x • Major bug fixes • Hadoop, and Lucene library upgrades • Version 1.0 • Flexible filter framework • Flexible scoring • Initial integration with Tika • Full Search Engine functionality and capabilities, in production at large scale (Internet Archive) • Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt

  9. What Doesn’t? • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture • http://search-lucene.com/m/gbrBF1RMWk9 • Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

  10. How I got involved • In this very class! • Okay well it used to be called Cs599, but you get the picture • Started out by contributing RSS parsing plugin • My final project in 599 • Moved on from there to • NUTCH-88, redesign of the parsing framework • NUTCH-139, Metadata container support • NUTCH-210, Web Context application file • And various other bug fixes, and contributions here and there • Mailing list support • Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

  11. Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System • NASA’s archive for all planetary science data collected by missions over the past 30 years • Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! • Built up a catalog of all data collected • Where does Nutch fit in?

  12. Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS

  13. Existing PDS Search Engine Architecture (e.g. Nutch, Google) Tomcat Crawler pds.war P D S - D Index Indexer PDS Catalog Catalog Metadata Lucene Web Server Parser Query PDS Parser PDS Extract PDS Google-like Search Architecture

  14. Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files • protocol-file plugin in Nutch • Wrote our own parse-pds plugin • Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin • Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin • Search the index on the fields that we want

  15. Search Interface

  16. Results

  17. Lessons Learned • Nutch currently isn’t exactly simple to deploy, or configure • There is much discussion on mailing lists that refer to “magic configuration” properties that aren’t intuitive • Nutch documentation is currently…lacking • If you know how to use Nutch then it is extremely easy to use, and a time-saver • Active participation in mailing lists, wiki, necessary to use Nutch

  18. Good News • Nutch is here to stay • Only open source, implementation for commodity web search • If you want to start your own Google++, Nutch is a great place to start • Participation is welcome • Look what happened to me (student-> commiter) • Plenty of areas to improve (including documentation)

  19. Your Class Project • It’s probably a good idea to at least take a look at Nutch, whether you use it or not • You can see how a real implementation of theory described in class operates • Implemented in pure Java (1.5) • Add/extend capabilities within Nutch • Help finish plugging Nutch into HBase • Configure Nutch using Spring • Fully integrate Nutch and Solr • Fix *important* bugs • Add more scoring algorithm implementations

  20. Wrapup • Thanks for your attention! • Nutch home page: • http://nutch.apache.org • Mailing lists • dev@nutch.apache.org (developer’s list) • user@nutch.apache.org (user’s list)

More Related