Full-Text Search with Lucene

Full-Text Search with Lucene Yonik Seeley yonik@apache.org 02 May 2007 Amsterdam, Netherlands

What is Lucene • High performance, scalable, full-text search library • Written by Doug Cutting, 100% Java • Focus: Indexing + Searching Documents • Easily embeddable, no config files • No crawlers or document parsing

Inverted Index aardvark Little Red Riding Hood 0 hood 0 1 little 0 2 Robin Hood 1 red 0 riding 0 robin 1 Little Women 2 women 2 zoo

Basic Application Hits (Matching Docs) Document field1: value1 field2: value2 field3: value3 Query addDocument() search() IndexWriter IndexSearcher Lucene Index

Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field("title", "Lucene in Action", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Erik Hatcher", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Otis Gospodnetic", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close();

Field Options • Indexed • Necessary for searching or sorting • Tokenized • Text analysis done before indexing • Stored • Compressed • Binary • Currently for stored-only fields

Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser("defaultField", analyzer); Query query = parser.parse("title:Lucene"); Hits hits = searcher.search(query); System.out.println(“matches:" + hits.length()); Document doc = hits.doc(0); System.out.println(“first:" + doc.get("title")); searcher.close();

Scoring • VSM – Vector Space Model • tf – numer of terms in field • lengthNorm – number of tokens in field • idf – number of documents containing term • coord – coordination factor, number of matching terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html

Query Construction Lucene QueryParser • Example: queryParser.parse("title:spiderman"); • good for IPC, human entered queries, debug • does text analysis and constructs appropriate queries • not all query types supported Programmatic query construction • Example: new TermQuery(new Term(“title”,”spiderman”)) • explicit, no escaping necessary

Query Examples • mission impossible • EQUIV: mission OR impossible • QueryParser default is “optional” • +mission +impossible –actor:cruise • EQUIV: mission AND impossible NOT cruise • “mission impossible” –actor:cruise • title:spiderman^10 description:spiderman • description:“spiderman movie”~10

Query Examples2 • releaseDate:[2000 TO 2007] • Range search: lexicographic ordering, so beware of numbers • Wildcard searches: te?t, te*t, test* • spider~ • Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7 • *:* • (a AND b) OR (c AND d)

Deleting Documents • IndexReader.deleteDocument(int id) • exclusive with IndexWriter • powerful • Deleting with IndexWriter • deleteDocuments(Term t) • updateDocument(Term t, Document d) • Deleting does not immediately reclaim space

Performance • Decrease index segments • Lower merge factor • Optimize • Use cached filters ‘+title:spiderman +released:true’ ‘title:spiderman’ filtered by ‘released:true’

Index Structure • IndexWriter params • MaxBufferedDocs • MergeFactor • MaxMergeDocs • MaxFieldLength segments_3 _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del

Search Relevancy Document Analysis Query Analysis PowerShot SD 500 power-shot sd500 WhitespaceTokenizer WhitespaceTokenizer PowerShot SD 500 power-shot sd500 WordDelimiterFilter catenateWords=0 WordDelimiterFilter catenateWords=1 Power Shot SD 500 power shot sd 500 PowerShot LowercaseFilter LowercaseFilter power shot sd 500 power shot sd 500 powershot A Match!

Tokenizers • Tokenizers break field text into tokens • StandardTokenizer • source string: “full-text lucene.apache.org” • “full” “text” “lucene.apache.org” • WhitespaceTokenizer • “full-text” “lucene.apache.org” • LetterTokenizer • “full” “text” “lucene” “apache” “org”

TokenFilters • LowerCaseFilter • StopFilter • LengthFilter • ISOLatin1AccentFilter • SnowballPorterFilter • stemming: reducing words to root form • rides, ride, riding => ride • country, countries => countri • contrib/analyzers for other languages

Analyzers class MyAnalyzer extends Analyzer { private Set myStopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); public TokenStream tokenStream(String fieldname, Reader reader) { TokenStream ts = new StandardTokenizer(reader); ts = new StandardFilter(ts); ts = new LowerCaseFilter(ts); ts = new StopFilter(ts, myStopSet); return ts; } }

Analysis Tips • Use PerFieldAnalyzerWrapper • Add same field more than once, analyze differently • Boost exact case matches • Boost exact tense matches • Query with or without synonyms • Soundex for sounds-like queries

Nutch • Open source web search application • Crawlers • Link-graph database • Document parsers (HTML, word, pdf, etc) • Language + charset detection • Utilizes Hadoop (DFS + MapReduce) for massive scalability

Solr • XML/HTTP, JSON APIs • Faceted search / navigation • Flexible Data Schema • Hit Highlighting • Configurable Caching • Replication • Web admin interface • Solr Flare: Ruby on Rails user interface

Questions?

Full-Text Search with Lucene