Introduction to Lucene

Introduction to Lucene Rong Jin

What is Lucene ? • Lucene is a high performance, scalable Information Retrieval (IR) library • Free, open-source project implemented in Java • Originally written by Doug Cutting • Become a project in the Apache Software Foundation in 2001 • It is the most popular free Java IR library. • Lucene has been ported to Perl, Python, Ruby, C/C++, and C# (.NET).

Lucene Users • IBM Omnifind Y! Edition • Technorati • Wikipedia • Internet Archive • LinkedIn • Eclipse • JIRA • Apache Roller • jGuru • More than 200 others

The Lucene Family • Lucene & Apache Lucene & Java Lucene: IR library • Nutch: Hadoop-loving crawler, indexer, searcher for web-scale SE • Solr: Search server • Droids: Standalone framework for writing crawlers • Lucene.Net: C#, incubator graduate • Lucy: C Lucene implementation • PyLecene: Python port • Tika: Content analysis toolkit

Indexing Documents Document Analyzer • Each document is comprised of multiple fields • Analyzer extracts words from texts • IndexWriter creates and writes inverted index to disk Dictionary Field Field Field : Tokenizer TokenFilter IndexWriter InvertedIndex

Indexing Documents

Lucene Classes for Indexing • Directory class • An abstract class representing the location of a Lucene index. • FSDirectorystores index in a directory in the filesystem, • RAMDirectory holds all its data in memory. • useful for smaller indices that can be fully loaded in memory and can be destroyed upon the termination of an application.

LuceneClasses for Indexing • IndexWriter Class • Creates a new index or opens an existing one, and adds, removes or updates documents in the index. • Analyzer Class • An abstract class for extracting tokens from texts to be indexed • StandardAnalyzer is the most common one

Lucene Classes for Indexing • Document Class • A document is a collection of fields • The meta-data such as author, title, subject, date modified, and so on, are indexed and stored separately as fields of a document.

Index Segments and Merge • Each index consists of multiple segments • Every segment is actually a standalone index itself, holding a subset of all indexed documents. • At search time, each segment is visited separately and the results are combined together. # ls -lh total 1.1G -rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt -rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx -rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm -rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq -rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm -rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii -rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2 -rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

Index Segments and Merge • Each segment consists of multiple files • _X.<ext> : X is the name and <ext> is the extension that identifies which part of the index that file corresponds to. • Separate files to hold the different parts of the index (term vectors, stored fields, inverted index, etc.). • Optimize() operation will merge all the segments into one • Involves a lot of disk IO and time consuming • Significantly improves search efficiency # ls -lh total 1.1G -rw-r--r-- 1 root root 123M 2009-03-14 10:29 _0.fdt -rw-r--r-- 1 root root 44M 2009-03-14 10:29 _0.fdx -rw-r--r-- 1 root root 33 2009-03-14 10:31 _9j.fnm -rw-r--r-- 1 root root 372M 2009-03-14 10:36 _9j.frq -rw-r--r-- 1 root root 11M 2009-03-14 10:36 _9j.nrm -rw-r--r-- 1 root root 180M 2009-03-14 10:36 _9j.prx -rw-r--r-- 1 root root 5.5M 2009-03-14 10:36 _9j.tii -rw-r--r-- 1 root root 308M 2009-03-14 10:36 _9j.tis -rw-r--r-- 1 root root 64 2009-03-14 10:36 segments_2 -rw-r--r-- 1 root root 20 2009-03-14 10:36 segments.gen

Lucene Classes for Reading Index • IndexReader class • Read index from the indexed file • Terms class • A container for all the terms in a specified field • TermsEnum class • Implement BytesRefIterator interface, providing interface for accessing each term

Reading Document Vector FieldTypefieldType = newFieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ; • Enable storing term vector at indexing step.

Reading Document Vector FieldTypefieldType = newFieldType(); fieldType.setStoreTermVectors( true); fieldType.setIndexed( true ); fieldType.setIndexOptions( IndexOptions.DOCS_AND_FREQS); fieldType.setStored( true ); doc.add( new Field(“contents”, contentString, fieldType )) ; • Enable storing term vector at indexing step. • Read document vector • Obtain each term in the document vector IndexReader reader = IndexReader.open( FSDirectory.open ( new File( indexPath )) ); intmaxDoc = reader.maxDoc(); for (inti=0; i<maxDoc; i++) { Terms terms = reader.getTermVector( i, “contents”); TermsEnumtermsEnum = terms.iterator( null ); BytesRef text = null; while ( (text = termsEnum.next()) !=null ) { Stringtermtext = text.utf8ToString(); intdocfreq = termsEnum.docFreq(); } }

Updating Documents in Index • IndexWriter.add(): add documents to the existing index • IndexWriter.delete(): remove documents/fields from the existing index • IndexWriter.update(): update documents in the existing index

Other Features of Lucene Indexing • Concurrency • Multiple IndexReaders may be open at once on a single index • But only one IndexWriter can be open on an index at once • IndexReaders may be open even while a single IndexWriter is making changes to the index; each IndexReader will always show the index as of the point-in-time that it was opened.

Other Features of Lucene Indexing • A file-based lock is used to prevent two writers working on the same index • If the file write.lock exists in your index directory, a writer currently has the index open; any attempt to create another writer on the same index will hit a LockObtainFailedException.

Search Documents

Lucene Classes for Searching • IndexSearcher class • Search through the index • TopDocs class • A container of pointers to the top N ranked results • Records the docID and score for each of the top N results (docID can be used to retrieve the document)

Lucene Classes for Searching • QueryParser • Parse a text query into the Query class • Need the analyzer to extract tokens from the text query • Search single term • Term class • Similar to Field, is pair of name and value • Use together TermQuery class to create query

Similarity Functions in Lucene • Many similarity functions are implemented in Lucene • Okapi (BM25Similarity) • Language model (LMDirichletSimilarity) • Example : Similarity simfn = new BM25Similarity(); searcher.setSimilarity(simfn); // searcher is an IndexSearcher

Similarity Functions in Lucene • Default similarity function • Allow implementing various similarity functions

Lucene Scoring in DefaultSimilarity • tf - how often a term appears in the document • idf - how often the term appears across the index • coord-number of terms in both the query and the document • lengthNorm-total number of terms in the field • queryNorm - normalization factor makes queries comparable • boost(index) – boost of the field at index-time • boost(query) – boost of the field at query-time http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Lucene Scoring in DefaultSimilarity • tf - how often a term appears in the document • idf - how often the term appears across the index • coord-number of terms in both the query and the document • lengthNorm-total number of terms in the field • queryNorm - normalization factor makes queries comparable • boost(index) – boost of the field at index-time • boost(query) – boost of the field at query-time sqrt( freq ) log(numDocs/(docFreq+1))+1 overlap/maxOverlap 1/sqrt( numTerms ) 1/sqrt(sumOfSquaredWeights) http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

Customizing Scoring • Subclass DefaultSimilarity and override the method you want to customize. • ignore how common a term appears across the index • Increase the weight of terms in “title” field

Queries in Lucene • Lucene support many types of queries • RangeQuery • PrefixQuery • WildcardQuery, BooleanQuery, PhraseQuery, …

Analyzers • Basic analyzers • Analyzers for different languages (in analyzers-common) • Chinese, Japanese, Arabic, German, Greek, ….

Analysis in Action "The quick brown fox jumped over the lazy dogs" WhitespaceAnalyzer : [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] SimpleAnalyzer : [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs] StopAnalyzer : [quick] [brown] [fox] [jumped] [over] [lazy] [dogs] StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

Analysis in Action "XY&Z Corporation - xyz@example.com" WhitespaceAnalyzer: [XY&Z] [Corporation] [-] [xyz@example.com] SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com] StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

Analyzer: Key Structure • Breaks text into a stream of tokens enumerated by the TokenStream class.

Analyzer • Breaks text into a stream of tokens enumerated by the TokenStream class. The only required method for Analyzer

Analyzer • Breaks text into a stream of tokens enumerated by the TokenStream class. Allows the TokenStream class to be reused; save space allocation and garbage collection

TokenStream Class • Two types of TokenStream • Tokenizer: a TokenStream that tokenizes the input from a Reader. i.e., chunks the input into Tokens. • TokenFilter: allows you to chain TokenStreams together, i.e., further modify the Tokens including removing it, stemming it, and other actions. • A chain usually includes 1Tokenizer and NTokenFilters

TokenStream Class • Example: StopAnalyzer Text TokenStream TokenStream StopFilter LowerCaseTokenizer

TokenStream Class • Example: StopAnalyzer Text TokenStream TokenStream StopFilter LowerCaseTokenizer Text TokenStream TokenStream TokenStream LetterTokenizer LowerCaseFilter StopFilter Order Matter !

TokenizerTokenFilter

Introduction to Lucene