1 / 22

Full-Text Search with Lucene

Learn about Lucene, a Java-based full-text search library for indexing and searching documents easily, with no need for crawlers or parsing. Understand the basics of Lucene indexing, document querying, and scoring.

cblizzard
Download Presentation

Full-Text Search with Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Full-Text Search with Lucene Yonik Seeley yonik@apache.org 02 May 2007 Amsterdam, Netherlands

  2. What is Lucene • High performance, scalable, full-text search library • Written by Doug Cutting, 100% Java • Focus: Indexing + Searching Documents • Easily embeddable, no config files • No crawlers or document parsing

  3. Inverted Index aardvark Little Red Riding Hood 0 hood 0 1 little 0 2 Robin Hood 1 red 0 riding 0 robin 1 Little Women 2 women 2 zoo

  4. Basic Application Hits (Matching Docs) Document field1: value1 field2: value2 field3: value3 Query addDocument() search() IndexWriter IndexSearcher Lucene Index

  5. Indexing Documents IndexWriter writer = new IndexWriter(directory, analyzer, true); Document doc = new Document(); doc.add(new Field("title", "Lucene in Action", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Erik Hatcher", Field.Store.YES, Field.Index.TOKENIZED)); doc.add(new Field("author", "Otis Gospodnetic", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close();

  6. Field Options • Indexed • Necessary for searching or sorting • Tokenized • Text analysis done before indexing • Stored • Compressed • Binary • Currently for stored-only fields

  7. Searching an Index IndexSearcher searcher = new IndexSearcher(directory); QueryParser parser = new QueryParser("defaultField", analyzer); Query query = parser.parse("title:Lucene"); Hits hits = searcher.search(query); System.out.println(“matches:" + hits.length()); Document doc = hits.doc(0); System.out.println(“first:" + doc.get("title")); searcher.close();

  8. Scoring • VSM – Vector Space Model • tf – numer of terms in field • lengthNorm – number of tokens in field • idf – number of documents containing term • coord – coordination factor, number of matching terms • document boost • query clause boost http://lucene.apache.org/java/docs/scoring.html

  9. Query Construction Lucene QueryParser • Example: queryParser.parse("title:spiderman"); • good for IPC, human entered queries, debug • does text analysis and constructs appropriate queries • not all query types supported Programmatic query construction • Example: new TermQuery(new Term(“title”,”spiderman”)) • explicit, no escaping necessary

  10. Query Examples • mission impossible • EQUIV: mission OR impossible • QueryParser default is “optional” • +mission +impossible –actor:cruise • EQUIV: mission AND impossible NOT cruise • “mission impossible” –actor:cruise • title:spiderman^10 description:spiderman • description:“spiderman movie”~10

  11. Query Examples2 • releaseDate:[2000 TO 2007] • Range search: lexicographic ordering, so beware of numbers • Wildcard searches: te?t, te*t, test* • spider~ • Fuzzy search: Levenshtein distance • Optional minimum similarity: spider~0.7 • *:* • (a AND b) OR (c AND d)

  12. Deleting Documents • IndexReader.deleteDocument(int id) • exclusive with IndexWriter • powerful • Deleting with IndexWriter • deleteDocuments(Term t) • updateDocument(Term t, Document d) • Deleting does not immediately reclaim space

  13. Performance • Decrease index segments • Lower merge factor • Optimize • Use cached filters ‘+title:spiderman +released:true’ ‘title:spiderman’ filtered by ‘released:true’

  14. Index Structure • IndexWriter params • MaxBufferedDocs • MergeFactor • MaxMergeDocs • MaxFieldLength segments_3 _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del

  15. Search Relevancy Document Analysis Query Analysis PowerShot SD 500 power-shot sd500 WhitespaceTokenizer WhitespaceTokenizer PowerShot SD 500 power-shot sd500 WordDelimiterFilter catenateWords=0 WordDelimiterFilter catenateWords=1 Power Shot SD 500 power shot sd 500 PowerShot LowercaseFilter LowercaseFilter power shot sd 500 power shot sd 500 powershot A Match!

  16. Tokenizers • Tokenizers break field text into tokens • StandardTokenizer • source string: “full-text lucene.apache.org” • “full” “text” “lucene.apache.org” • WhitespaceTokenizer • “full-text” “lucene.apache.org” • LetterTokenizer • “full” “text” “lucene” “apache” “org”

  17. TokenFilters • LowerCaseFilter • StopFilter • LengthFilter • ISOLatin1AccentFilter • SnowballPorterFilter • stemming: reducing words to root form • rides, ride, riding => ride • country, countries => countri • contrib/analyzers for other languages

  18. Analyzers class MyAnalyzer extends Analyzer { private Set myStopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); public TokenStream tokenStream(String fieldname, Reader reader) { TokenStream ts = new StandardTokenizer(reader); ts = new StandardFilter(ts); ts = new LowerCaseFilter(ts); ts = new StopFilter(ts, myStopSet); return ts; } }

  19. Analysis Tips • Use PerFieldAnalyzerWrapper • Add same field more than once, analyze differently • Boost exact case matches • Boost exact tense matches • Query with or without synonyms • Soundex for sounds-like queries

  20. Nutch • Open source web search application • Crawlers • Link-graph database • Document parsers (HTML, word, pdf, etc) • Language + charset detection • Utilizes Hadoop (DFS + MapReduce) for massive scalability

  21. Solr • XML/HTTP, JSON APIs • Faceted search / navigation • Flexible Data Schema • Hit Highlighting • Configurable Caching • Replication • Web admin interface • Solr Flare: Ruby on Rails user interface

  22. Questions?

More Related