Introduction to Nutch

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2011

Outline • What is Nutch? • Motivation • Architecture • What currently exists? • How I got involved • Deploying Nutch on NASA’s Planetary Data System (PDS) • Free text “Google-like” search of the PDS catalog • Architecture/Implementation

What is Nutch? • The brainchild of Doug Cutting • Research/programmer guru who has worked at several high profile research labs (Yahoo, Bell Labs) • Nutch builds upon Cutting’s lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene

Motivation • Observation: Web Search is a commodity • Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

Motivation • Value-added capabilities • Improving fetching speed • Parsing and handling of the hundreds of different content types available on the internet • Handling different protocols for obtaining content • Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

Nutch’s Architecture • Nutch Core facilities • Parsing • Indexing • Crawling • Content Management • Querying • Plugin Framework • Nutch’s extension points • Scoring, Parsing, Indexing, Querying, URLFiltering

Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

What Currently Exists? • Version 0.6.x • First easily deployable version • Version 0.7.x • Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system • Version 0.8.x • Completely new underlying architecture based on Hadoop • Parse plugins framework, multi-valued metadata container • Parser Factory enhancement • Version 0.9.x • Major bug fixes • Hadoop, and Lucene library upgrades • Version 1.0 • Flexible filter framework • Flexible scoring • Initial integration with Tika • Full Search Engine functionality and capabilities, in production at large scale (Internet Archive) • Version 1.1, For full list, see http://svn.apache.org/repos/asf/nutch/trunk/CHANGES.txt

What Doesn’t? • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture • http://search-lucene.com/m/gbrBF1RMWk9 • Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

How I got involved • In this very class! • Okay well it used to be called Cs599, but you get the picture • Started out by contributing RSS parsing plugin • My final project in 599 • Moved on from there to • NUTCH-88, redesign of the parsing framework • NUTCH-139, Metadata container support • NUTCH-210, Web Context application file • And various other bug fixes, and contributions here and there • Mailing list support • Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System • NASA’s archive for all planetary science data collected by missions over the past 30 years • Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! • Built up a catalog of all data collected • Where does Nutch fit in?

Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS

Existing PDS Search Engine Architecture (e.g. Nutch, Google) Tomcat Crawler pds.war P D S - D Index Indexer PDS Catalog Catalog Metadata Lucene Web Server Parser Query PDS Parser PDS Extract PDS Google-like Search Architecture

Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files • protocol-file plugin in Nutch • Wrote our own parse-pds plugin • Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin • Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin • Search the index on the fields that we want

Search Interface

Results

Lessons Learned • Nutch currently isn’t exactly simple to deploy, or configure • There is much discussion on mailing lists that refer to “magic configuration” properties that aren’t intuitive • Nutch documentation is currently…lacking • If you know how to use Nutch then it is extremely easy to use, and a time-saver • Active participation in mailing lists, wiki, necessary to use Nutch

Good News • Nutch is here to stay • Only open source, implementation for commodity web search • If you want to start your own Google++, Nutch is a great place to start • Participation is welcome • Look what happened to me (student-> commiter) • Plenty of areas to improve (including documentation)

Your Class Project • It’s probably a good idea to at least take a look at Nutch, whether you use it or not • You can see how a real implementation of theory described in class operates • Implemented in pure Java (1.5) • Add/extend capabilities within Nutch • Help finish plugging Nutch into HBase • Configure Nutch using Spring • Fully integrate Nutch and Solr • Fix *important* bugs • Add more scoring algorithm implementations

Wrapup • Thanks for your attention! • Nutch home page: • http://nutch.apache.org • Mailing lists • dev@nutch.apache.org (developer’s list) • user@nutch.apache.org (user’s list)

Introduction to Nutch