Document Indexing and Scoring in Lucene and Nutch

Document Indexing and Scoring in Lucene and Nutch IST 441 Spring 2009 Instructor: Dr. C. Lee Giles Presenter: Saurabh Kataria

Outline • Architecture of Lucene and Nutch • Indexing in Lucene • Searching in Lucene • Lucene’s scoring function

Lucene’s Open Architecture Crawling Parsing Indexing Lucene Stop Analyzer Standard Analyzer CN/DE/ Analyzer PDF HTML DOC TXT … File System Lucene Docu- ments TXT parser FS Crawler Index indexer PDF parser WWW indexer Larm HTML parser IMAP Server searcher searcher Searching Spring 2008

Nutch’s architecture • Courtesy of Doug Cutting’s presentation slide in WWW 2004

Nutch’s architecture • Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. • Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. • Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched. • Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch.

Lucene’s index (conceptual) Index Document Document Field Field Field Document Name Value Field Document Field Spring 2008

Create a Lucene index (step 1) • Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(new Field(“text", “content”, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field(“title", “test”, Field.Store.YES, Field.Index.TOKENIZED)); } Spring 2008

Create a Lucene index (step 2) • Create an Analyser • Options • WhitespaceAnalyzer • divides text at whitespace • SimpleAnalyzer • divides text at non-letters • convert to lower case • StopAnalyzer • SimpleAnalyzer • removes stop words • StandardAnalyzer • good for most European Languages • removes stop words • convert to lower case Spring 2008

Create a Lucene index (step 2) • An example of analyzing a document Spring 2008

Create a Lucene index (step 3) • Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriterwriter = new IndexWriter(FSDirectory.getDirectory(“/data/index", true), new SimpleAnalyzer(), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) { System.out.println(“I/O Error!”); } } Spring 2008

Luence Index – Behind the Scene • Inverted Index (Inverted File) Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State Spring 2008

Posting table • Posting table is a fast look-up mechanism • Key: word • Value: posting id, satellite data (#df, offset, …) • Lucene implements the posting table with Java’s hash table • Objectified from java.util.Hashtable • Hash function depends on the JVM • hc2 = hc1 * 31 + nextChar • Posting table usage • Indexing: insertion (new terms), update (existing terms) • Searching: lookup, and construct document vector Spring 2008

Lucene Index Files: Field infos file (.fnm) 1, <content, 0x01> Spring 2008

Lucene Index Files: Term Dictionary file (.tis) 4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2> Document Frequency can be obtained from this file. Spring 2008

Lucene Index Files: Term Info index (.tii) 4,<football,1> <penn,3><layers,2> <state,1> Spring 2008

Lucene Index Files: Frequency file (.frq) <<2, 2, 3> <3> <5> <3, 3>> Term Frequency can be obtained from this file. Spring 2008

Lucene Index Files: Position file (.prx) <<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>> Spring 2008

Query Process in Lucene Field info (in Memory) Constant time Query Term Info Index (in Memory) Constant time Constant time Constant time Constant time Term Dictionary (Random file access) Frequency File (Random file access) Position File (Random file access) Spring 2008

Search Lucene’s index (step 1) • Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String querystring) { QueryParser qp = new QueryParser (field, new StandardAnalyser( )); Query query = qp.parse(querystring); } Spring 2008

Search Lucene’s index (step 1) • Types of query: • Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] • wildcard: [nu?ch] [nutc*] • phrase: [“JAVA TOMCAT”] • proximity: [“lucene nutch” ~10] • fuzzy: [roam~] matches roams and foam • date range • … Spring 2008

Search Lucene’s index (step 2) • Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath, false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); } Spring 2008

Search Lucene’s index (step 3) • Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results System.out.println(“id”+doc.get(id)); } Spring 2008

Default Scoring Function • Similarity score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q ( tf(t in D) · idf(t)2 · t.getBoost() · norm(D) ) • Question: • What type of IR model does Lucene use? • factors • term-based factors • tf(t in D) : term frequency of term t in document d • default implementation • idf(t): inverse document frequency of term t in the entire corpus • default implementation Spring 2008

Default Scoring Function • coord(Q,D) = overlap between Q and D / maximum overlap • Maximum overlap is the maximum possible length of overlap between Q and D • queryNorm(Q) = 1/sum of square weight½ • sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 • If t.getBoost() = 1, q.getBoost() = 1 • Then, sum of square weight = ∑ t in Q ( idf(t) )2 • thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½ • norm(D) = 1/number of terms½(This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.) Spring 2008

Example: • D1: hello, please say hello to him. • D2: say goodbye • Q: you say hello • coord(Q, D) = overlap between Q and D / maximum overlap • coord(Q, D1) = 2/3, coord(Q, D2) = 1/2, • queryNorm(Q) = 1/sum of square weight½ • sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 • t.getBoost() = 1, q.getBoost() = 1 • sum of square weight = ∑ t in Q ( idf(t) )2 • queryNorm(Q) = 1/(0.59452+12) ½ =0.8596 • tf(t in d) = frequency½ • tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 • tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0 • idf(t) = ln (N/(nj+1)) + 1 • idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0.5945, idf(hello) = ln(2/(1+1)) +1 = 1 • norm(D) = 1/number of terms½ • norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071 • Score(Q, D1) = 2/3*0.8596*(1*0.59452+1.4142*12)*0.4082=0.4135 • Score(Q, D2) = 1/2*0.8596*(1*0.59452)*0.7071=0.1074 Spring 2008

Document Indexing and Scoring in Lucene and Nutch