160 likes | 445 Views
The Lucene Search Engine. Kira Radinsky Modified by Amit Gross to Lucene 4. Based on the material from: Thomas Paul and Steven J. Owens. What is Lucene ?. Doug Cutting’s grandmother’s middle name A open source set of Java Classses Search Engine/Document Classifier/Indexer
E N D
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens
What is Lucene? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses • Search Engine/Document Classifier/Indexer • Developed by Doug Cutting (1996) • Xerox/Apple/Excite/Nutch/Yahoo/Cloudera • Hadoop founder, Board of directors of the Apache Software • Jakarta Apache Product. Strong open source community support. • High-performance, full-featured text search engine library • Easy to use yet powerful API
Use the Source, Luke • Document • Field • Represents a section of a Document: name for the section + the actual data. • Analyzer • Abstract class (to provide interface) • Document -> tokens (for later indexing) • StandardAnalyzer class. • IndexWriter • Creates and maintains indexes. • IndexSearcher • Searches through an index. • QueryParser • Builds a parser that can search through an index. • Query • Abstract class that contains the search criteria created by the QueryParser. • TopDocs • Contains the top K Document objects found in a serach by an IndexSearcher, and their scores.
Document from an article private DocumentcreateDocument(String article, String author, String title, String topic, Stringurl, DatedateWritten) { document.add(newTextField("author",author, Store.YES)); document.add(newTextField("title",title, Store.YES )); document.add(newTextField("topic",topic, Store.YES )); document.add(newTextField("article", article, Store.NO)); document.add(newStoredField("URL", url)); document.add(newStringField("Date", dateWritten, Store.NO)); return document; }
Store a Document in the index Directory dir= FSDirectory.open(new File("lucene-index")); privatevoidindexDocument(Documentdocument) throwsException { Analyzer analyzer = newStandardAnalyzer(Version.LUCENE_45); IndexWriterConfigiwc = newIndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriterwriter = new IndexWriter(dir, iwc); writer.addDocument(document); writer.close(); }
Adding to an Index public void indexArticle( String article, String author, String title, String topic, Stringurl, DatedateWritten) throwsException { Documentdocument = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }
Searching Analyzeranalyzer = newStandardAnalyzer(Version.LUCENE_45); IndexSearchersearcher = newIndexSearcher(DirectoryReader.open(dir)); QueryParserqp = newQueryParser(Version.LUCENE_45, "article", analyzer); Query q = qp.parse(searchString); TopDocstop = searcher.search(q, numResults);
Extracting Document objects for (ScoreDocsd : top.scoreDocs) { Document doc = searcher.doc(sd.doc); // display the articles that were found to the user }
Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches • author:Henry relativity AND "quantum physics“ • "string theory" NOT Einstein • "Galileo Kepler"~5 • author:Johnson date:[01/01/2004 TO 01/31/2004]
Thread Safety • Indexing and searching are not only thread safe, but process safe. What this means is that: • Multiple index searchers can read the lucene index files at the same time. • An index writer or reader can edit the lucene index files while searches are ongoing • Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). • The query parser is not thread safe, • The index writer however, is thread safe,
Luke • Luke is a handy tool for development, that allows you to watch an already existing Lucene Index. • http://code.google.com/p/luke/