160 likes | 433 Views
Lucene-Demo . Brian Nisonger. Intro. No details about Implementation/Theory See Treehouse Wiki- Lucene for additional info Set of Java classes Not an end to end solution Designed to allow rapid development of IR tools. Index.
E N D
Lucene-Demo Brian Nisonger
Intro • No details about Implementation/Theory • See Treehouse Wiki- Lucene for additional info • Set of Java classes • Not an end to end solution • Designed to allow rapid development of IR tools
Index • The first step is to take a set of text documents and build an Index • Demo:IndexFiles on Pongo • Two major classes • Analyzer • Used to Tokenize data • More on this later • IndexWriter • IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
Index Writer • Index Writer creates an index of documents • First argument is a directory of where to build/find the index • Second argument calls an Analyzer • Third argument determines if a new index should be created
Analyzer • Standard Analyzer • Porter Stemming w/ Stop Words • Krovetz Stemmer-Example • package org.apache.lucene.analysis; • import org.apache.lucene.analysis.Analyzer; • import org.apache.lucene.analysis.standard.*; • import org.apache.lucene.analysis.TokenStream; • import org.apache.lucene.analysis.StopFilter; • import org.apache.lucene.analysis.LowerCaseTokenizer; • import org.apache.lucene.analysis.KStemFilter; • import java.io.Reader; • public class KStemAnalyzer extends Analyzer • { • public final TokenStream tokenStream(String fieldName, Reader reader) • { • return new KStemFilter(new LowerCaseTokenizer(reader)); • } • }
Analyzer-II • Snowball Stemmer • A stemmer language created by Porter used to build Stemmers • Multilingual analyzers/Stemmers • Porter2 • Fully Integrated with Lucene 1.9.1 • MyAnalyzer(Home Built) • Demo
Adding Documents • The Next step after creating an index is to add documents • writer.addDocument(FileDocument.Document(file)); • Remember we already determined how the document will be tokenized • Fields • Can split document in to parts such as document title,body,date created, paragraphs
Adding Documents-II • Assigns Token/doc ID • For why this is important see Lucene –TreeHouse Wiki • Create some type of loop to add all the documents • This is the actual creation of the Index before we merely set the Index parameters
Finalizing Index Creation • After that the Index is optimized with writer.optimize(); • Merges etc. • The Index is close with writer.close();
Searching an Index • Open Index • IndexReader reader = IndexReader.open(index); • Create Searcher • Searcher searcher = new IndexSearcher(reader); • Assign Analyzer • Use the same Analyzer used to create Index (Why?)
Searching an Index-II • Parse/Create query • Query query = QueryParser.parse(line, field, analyzer); • Takes a line, looks for a particular field, and runs it through an analyzer to create query • Determine which documents are matches • Hits hits = searcher.search(query);
Retrieving Documents • Hits creates a collection of documents • Using a loop we can reference each doc • Document doc = hits.doc(i); • This allows us to get info about the document • Name of document, date is was created, words in document • Relevancy Score(TF/IDF) • Demo
Finishing Searching • Return list of documents • Close Reader
Other Functions • Spans (Example from http://lucene.apache.org/java/docs/api/index.html) • Useful for Phrasal matching • Allows for Passage Retrieval
Questions? • Any Questions, comments, jokes, opinions??
I said “Good Day” • The END