Nutch Search Engine Tool

Nutch Search Engine Tool

Nutch overview • A full-fledged web search engine • Functionalities of Nutch • Internet and Intranet crawling • Parsing different document formats (PDF, HTML, XML, JS, DOC,PPT etc.)‏ • Web interface for querying the index • Management of Recrawls

Nutch Architecture • 4 main components • Crawler • Web Database (WebDB, LinkDB, segments)‏ • Indexer • Searcher • Crawler and Searcher are highly decoupled enabling independent scaling • Highly modular, Plugin based architechture

Nutch Architecture Doug Cutting, "Nutch: Open Source Web Search", 22 May 2004, WWW2004, New York

Steps in a Crawl+Index cycle • Create a new WebDB (admin db -create). • Inject root URLs into the WebDB (inject). • Generate a fetchlist from the WebDB in a new segment (generate). • Fetch content from URLs in the fetchlist (fetch). • Update the WebDB with links from fetched pages (updatedb). • Repeat steps 3-5 until the required depth is reached. • Update segments with scores and links from the WebDB (updatesegs). • Index the fetched pages (index). • Eliminate duplicate content (and duplicate URLs) from the indexes (dedup). • Merge the indexes into a single index for searching (merge).

Crawling (cont.) • Can effectively crawl upto ~100M pages • Crawl Statistics on KReSIT site (it.iitb) • Took 153 mins for a deep crawl (depth = 10) • Crawled 4171 documents • Size of crawl on disk 168MB • Size of index ~25MB

Web Database (WebDB)‏ • Persistent data structure for mirroring the structure and properties of the web graph being crawled • The WebDB stores two types of entities • Pages • Links • Optimised for frequent updation

Crawl Structure of it.iitb

Page DB • Page Database • used for fetch scheduling • Contains: • pages indexed and sorted by MD5 and URL • outlinks, fetch information, page score • A set of APIs are provided to perform the various operations

Page 1: Version: 4 URL: http://keaton/tinysite/A.html ID: fb8b9f0792e449cda72a9670b4ce833a Next fetch: Thu Nov 24 11:13:35 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 1 Score: 1.0 NextScore: 1.0 Page 2: Version: 4 URL: http://keaton/tinysite/B.html ID: 404db2bd139307b0e1b696d3a1a772b4 Next fetch: Thu Nov 24 11:13:37 GMT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 3 Score: 1.0 NextScore: 1.0 Sample data of PageDB

Link DB • Link Database • Contains: • links sorted by MD5 • links sorted by URL • Represents full link graph. • Stores anchor text associated with each link • Used for: • Link analysis; • Anchor text indexing.

Segments • Collection of pages fetched and indexed by the crawler in a single run • One segment dir for each crawl-fetch-update cycle at a particular depth • Contains raw text and parsed data of the files crawled • Used to return the cached copy of a page and in snippet generation in results page

Segments • segread tool gives a useful summary of all segments. (Parsed, Started, Finished, Dir) • It can also be used to dump the segment data in raw text format. The dump switch gives the following details • Fetcher Output: <url, hash, fetch-date ..>. Entries that go into the WebDB • Content: Raw content including http-headers and other meta-data. stored cached copy of a page • ParseData & ParseText: appropriate parser plugin by looking at the Raw content, is used to generate this data

Nutch API

Plugins • Provide extensions to extension-points • Each extension point defines an interface that must be implemented by extension • Some core extension points • IndexingFilter: add meta-data to indexed fields • Parser: to parse a new type of document • NutchAnalyzer: language specific analyzers

References • Nutch Docs: http://lucene.apache.org/nutch/ • Nutch Wiki: http://wiki.apache.org/nutch/ • Prasad Pingali, CLIA consortium, Nutch Workshop, 2007 • Tom White, Introduction to Nutch, java.net website (http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html)

Nutch Search Engine Tool