The Lucene Search Engine

The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens

What is Lucene? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses • Search Engine/Document Classifier/Indexer • Developed by Doug Cutting (1996) • Xerox/Apple/Excite/Nutch/Yahoo/Cloudera • Hadoop founder, Board of directors of the Apache Software • Jakarta Apache Product. Strong open source community support. • High-performance, full-featured text search engine library • Easy to use yet powerful API

Use the Source, Luke • Document • Field • Represents a section of a Document: name for the section + the actual data. • Analyzer • Abstract class (to provide interface) • Document -> tokens (for later indexing) • StandardAnalyzer class. • IndexWriter • Creates and maintains indexes. • IndexSearcher • Searches through an index. • QueryParser • Builds a parser that can search through an index. • Query • Abstract class that contains the search criteria created by the QueryParser. • TopDocs • Contains the top K Document objects found in a serach by an IndexSearcher, and their scores.

Indexing a Document

Document from an article private DocumentcreateDocument(String article, String author, String title, String topic, Stringurl, DatedateWritten) { document.add(newTextField("author",author, Store.YES)); document.add(newTextField("title",title, Store.YES )); document.add(newTextField("topic",topic, Store.YES )); document.add(newTextField("article", article, Store.NO)); document.add(newStoredField("URL", url)); document.add(newStringField("Date", dateWritten, Store.NO)); return document; }

The Field Object

Store a Document in the index Directory dir= FSDirectory.open(new File("lucene-index")); privatevoidindexDocument(Documentdocument) throwsException { Analyzer analyzer = newStandardAnalyzer(Version.LUCENE_45); IndexWriterConfigiwc = newIndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriterwriter = new IndexWriter(dir, iwc); writer.addDocument(document); writer.close(); }

Analyzers and Tokenizers

Adding to an Index public void indexArticle( String article, String author, String title, String topic, Stringurl, DatedateWritten) throwsException { Documentdocument = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }

Searching the Index

Searching Analyzeranalyzer = newStandardAnalyzer(Version.LUCENE_45); IndexSearchersearcher = newIndexSearcher(DirectoryReader.open(dir)); QueryParserqp = newQueryParser(Version.LUCENE_45, "article", analyzer); Query q = qp.parse(searchString); TopDocstop = searcher.search(q, numResults);

Extracting Document objects for (ScoreDocsd : top.scoreDocs) { Document doc = searcher.doc(sd.doc); // display the articles that were found to the user }

Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches • author:Henry relativity AND "quantum physics“ • "string theory" NOT Einstein • "Galileo Kepler"~5 • author:Johnson date:[01/01/2004 TO 01/31/2004]

Thread Safety • Indexing and searching are not only thread safe, but process safe. What this means is that: • Multiple index searchers can read the lucene index files at the same time. • An index writer or reader can edit the lucene index files while searches are ongoing • Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). • The query parser is not thread safe, • The index writer however, is thread safe,

Luke • Luke is a handy tool for development, that allows you to watch an already existing Lucene Index. • http://code.google.com/p/luke/

The Lucene Search Engine

The Lucene Search Engine

Presentation Transcript

Lucene Near Realtime Search

Full-Text Search with Lucene

Engine to Luwak/Lucene

Search Engine

Search Engine

Search Engine

Search Engine

Search Engine

SEARCH ENGINE

Search Engine

Search Engine

Search Engine

Full-Text Search with Lucene

Search engine

Lucene/SOLR 2: Lucene search API

Using the Lucene Search Engine

Search Engine

search engine

SEARCH ENGINE

Full-Text Search with Lucene