Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation

Roadmap • What is Nutch? • What are the current versions of Nutch? • What can it do? • What did we do right? • What did we do wrong? • Where is Nutch going?

And you are? • Apache Member involved in • Tika (VP,PMC), Nutch (PMC), Incubator (PMC), OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion) • Architect/Developer at NASA JPL in Pasadena, CA • Software Architecture/Engineering Prof at USC

is… • A project originally started by Doug Cutting • Nutch builds upon the lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene • Allows you to sand up a web-scale infra.

Community • Mailing lists • User: 972 peeps • Dev: 520 peeps • Committers/PMC • 8 peeps • All 8 active: SERIOUSLY • Releases • 11 releases so far • Working on 2.0 Credit: svnsearch.org

Why Nutch? • Observation: Web Search is a commodity • Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities

Why Nutch? • Value-added capabilities • Improving fetching speed • Parsing and handling of the hundreds of different content types available on the internet • Handling different protocols for obtaining content • Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework

Nutch’s Architecture • Nutch Core facilities • Parsing • Indexing • Crawling • Content Acquisition • Querying • Plugin Framework • Nutch’s extension points • Scoring, Parsing, Indexing, Querying, URLFiltering

Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page

What Currently Exists? • Version 0.6.x • First easily deployable version • Version 0.7.x • Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system • Version 0.8.x • Completely new underlying architecture based on Hadoop • Parse plugins framework, multi-valued metadata container • Parser Factory enhancement • Version 0.9.x • Major bug fixes • Hadoop, and Lucene library upgrades • Version 1.0 • Flexible filter framework • Flexible scoring • Initial integration with Tika • Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)

What are the recent versions? • Version 1.1, upgrade all Nutch library deps (Hadoop, Tika, etc.) and make Fetcher faster • Version 1.2, fix some big time bugs (NPE in distributed search), lots of feature upgrades • You should be using this version

Some active dev areas • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture • http://search-lucene.com/m/gbrBF1RMWk9 • Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM

Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System • NASA’s archive for all planetary science data collected by missions over the past 30 years • Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! • Built up a catalog of all data collected • Where does Nutch fit in?

Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS

Existing PDS Search Engine Architecture (e.g. Nutch, Google) Tomcat Crawler pds.war P D S - D Index Indexer PDS Catalog Catalog Metadata Lucene Web Server Query Parser PDS Parser PDS Extract PDS Google-like Search Architecture Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S. Hardman, C. Mattmann

Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files • protocol-file plugin in Nutch • Wrote our own parse-pds plugin • Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin • Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin • Search the index on the fields that we want

Search Interface

Results

Some Nutch History • In the next few slides, we’ll go through some of Nutch’s history, including my involvement, the history of Nutch dev, and how we came to today

How I got involved • In CS72: Seminar on Search Engines at USC • Okay well it used to be called CS599, but you get the picture • Started out by contributing RSS parsing plugin • My final project in 599 • Moved on from there to • NUTCH-88, redesign of the parsing framework • NUTCH-139, Metadata container support • NUTCH-210, Web Context application file • And various other bug fixes, and contributions here and there • Mailing list support • Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member

The Big Yellow Elephant • Before this guy was born • Lots of folks interested in Nutch Hadoop is born (January 2008) Credit: svnsearch.org

Post Hadoop Life • Nutch project kind of withered • Well more than “kind of” it did wither • Went years in-between a release • 0.8 to 1.0 took a while • Dev Community went into maintenance mode • Many committers simply went inactive • User Community deteriorated

Some Observations • It was pretty difficult to attract new committers • Took too long to VOTE them in • They were only interested in Hadoop type stuff • Not many organizations were doing web-scale search • Existing active committers dwindled • I was one of them!

Some Observations • There wasn’t a plan for what to do next • What features to work on? • What bugs to fix? • Many considered Nutch to be “production” worthy in its current form and not a huge number of internet-scale users so people just “put up” with its existing issues, e.g., difficult to configure ?

Hadoop wasn’t the only spinoff • A lot of us interested in content detection and analysis, another major Nutch strength, went off to work on that in some other Apache project that I can’t remember the name of

How can Nutch reorganize? • Strong feeling from Nutch community that we should take whomever is left and think about what the “next generation” Nutch would look like • (Several cycles of) Mailing threads started by Andrzej Bialecki, Dennis Kubes, Otis Gospondetic

Initial Nutch2 fizzles • Ended up being a lot of talk, but there wasn’t enough interest to pick up a shovel and help dig the hole • But…there were interestingthings going on • Example: Nutchbase workfrom Dogacan, and Enis

What was “Nutchbase”? • Take the Apache implementation of Google’s “BigTable” • Col oriented storge, high scalability in columns and rows • Store Nutch Web page content +

Lots of interest in Nutchbase • But, sadly maintained as a patch for a year or more • NUTCH-650 Hbase integration • Brought about some interesting thoughts • If storage can be abstracted, what about? • Messaging layer (JMS Nutch?) • Parsing? • Indexing (Solr, Lucene, you-name-it)

Post Nutch 1.0 • Nutch 1.0 release was a true “1.”-oh! • Included production features • Those using it were happy, b/c they had bought into the model • Useable, tuneable • But, how do we get to Nutch 2.0?

A few things happen in parallel • 1.1 Release? • I had some free time and was willing to RM a Nutch 1.1 release to get things going • Dogacan, Enis, Julien and Andrzej got interested in moving Nutchbase forward • But took it to the next level…we’ll get back to this • We elected a new committer • Julien Nioche • Patches that had sat for years now got committed

Oh, and Nutch became TLP • Grabbed folks that were active in Nutch community • Decided to move forward with Nutch/HBase as the de-facto platform • No need to maintain home-grown storage formats • And, take it to the next level, to ORM-ness • Decided to make Nutch a “delegator” rather than a workhorse • In other words…

Nutch2: “Delegator” • Indexing/Querying? • Solr has a lot of interest and does tons of work in this area: let’s use it instead of vanilla Lucene • Parsing? • Tika: ditto • Storage • Let’s use the ORM layer that some of the Nutch committers were working on

Enter Gora: “that ORM technology” • Initially baked up at Github • Decided to moveto the Incubator in Sept 2010 • I was contacted and asked tochampion the effort • What is Gora? • Uses Apache Avro to specify objects and their schema • ORM middleware takes Avro specs, generates Java code – plugs for HBase, Cassandra, in-memory SQL store, etc.

Nutch and Gora • Throw out all code in Nutch that had to do with Writeable interface • Generated now by “Web Page” schema in Gora • Web Page is canonical Nutch object for storage • Parse text, parse data, etc. • No more web-db, crawl-db, etc.

Out with the old… • Throw out Nutch webapp • Solr provides REST-ful services to get at metadata/index • We’ll add the REST (pun) for storage/etc. • Throw out Lucene code • Slowly trash existing Nutch parsers

In with the new • Get rid of webapp • Nutch 2.x has seen contributions of REST web services for full crawl cycle, storage I/F • Delegate indexing to Solr • Nutch 1.x first appearance of SolrIndexer and Nutch Solr schema • Delegate parsing to Tika • Nutch 1.1 first appearance of parse-tika • Have been decommissioning existing parsers • Suggested improvements to Tika during this process

Nutch2 Architecture

Learning from our mistakes • Maintenance • Checking in jars made the Nutch checkout huge (even of just the “source”) • Now using Ivy to manage dependencies • Patches sitting? • Not on my watch! Encouragement to find and commit patches that have been sitting for a while, or simply disposition them • People want to use Nutch code as “dep” • Build now includes ability for RM to push to Maven Central NOTE: CHRIS’S OPINION SLIDE

Learning from our mistakes • Community • Folks contributing patches? • Make em’ a committer • Folks providing good testing results? • Make em’ a committer • Folks making good documentation? • Make em’ a committer • It’s the sign of a healthy Apache project if new committers (and members) are being elected NOTE: CHRIS’S OPINION SLIDE

Learning from our mistakes • Configuration of Nutch is hard • It still is  • Getting easier though • Anyone have any great ideas or patches to integrate with a DI framework? • Things like GORA, Solr, etc, are making this easier • Providing flexible service interfaces beyond Java APIs • Existing work on NUTCH-932, NUTCH-931 and NUTCH-880 is just the beginning

Interesting work going on • I taught a class on Search Engines this past summer • Some neat projects that I’m working with my students to contribute back to Apache • Implementation of Authority/Hub scoring • Deduplication improvements • Clustering plugin improvements • Work to improve Nutch-Solr-Drupal integration

Wrapup • Nutch has seen tremendous highs and lows over years • We’re still kicking • The newest version of Nutch (2.0) will have a vastly slimmed down footprint, and will use existing successful frameworks for heavy lifting • Solr, Tika, Gora, Hadoop • If you’re interested in our dev, check us out at http://nutch.apache.org

Alright, I’ll shut up now • Any questions? • THANK YOU! • mattmann@apache.org • @chrismattmann on Twitter

Acknowledgements • Nutch team • Some material inspired from Andrzej Bialecki’s talks here • OODT team at JPL

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond