E N D
Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main tasks: building an index, and searching that index. This is definitely the case with Lucene (and the only time when this isn’t the case is if your search goes directly to the database). We wanted to keep the search interface fairly simple, so the code that interacts from the system sees two main interfaces, IndexBuilder, and IndexSearch.
Lucene Index Builder Any process which needs to build an index goes through the IndexBuilder. This is a simple interface which gives you two entrance points to the indexing process: By passing individual configuration settings to the class E.g. path to the index, if you want to do an incremental build, and how often to optimize the Lucene index as you add records. By passing in an index plan name IndexBuilder will then look up the settings it needs from the configuration system. This allows you to tweak your variables in an external file, rather than code.
Lucene Index Sources The Index Builder abstracts the details of Lucene, and the Index Sources that are used to create the index itself. The search interface is also kept very simple. A search is done via: IndexSearch.search(String inputQuery, int resultsStart, int resultsCount); e.g. Look for the terms EJB and WebLogic, returning up to the first 10 results: IndexSearch.search("EJB and WebLogic", 0, 10);
Lucene How to tweak the ranking of records There is one piece of logic that goes above and beyond munging the data to a Lucene friendly manner. It is in this class that we calculate any boosts that we want to place on fields, or the document itself. It turns out that we end up with the following boosters: The date boost has been really important for us. We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up.
Lucene Lucene Index Anatomy You can successfully use Lucene without understanding this directory structure. Feel free to skip this section and treat the directory as a black box without regard to what is inside. When you are ready to dig deeper you'll find that the files you created in the last section contain statistics and other data to facilitate rapid searching and ranking. An index contains a sequence of documents. In our indexing example, each document represents information about a text file.
Lucene Documents Documents are the primary retrievable units from a Lucene query. Documents consist of a sequence of fields. Fields have a name ("contents" and "filename" in our example). Field values are a sequence of terms.
Lucene Terms A term is the smallest piece of a particular field. Fields have three attributes of interest: Stored -- Original text is available in the documents returned from a search. Indexed -- Makes this field searchable. Tokenized -- The text added is run through an analyzer and broken into relevant pieces (only makes sense for indexed fields).
Lucene Analysis Tokenized fields are where the real fun happens. In our example, we are indexing the contents of text files. The goal is to have the words in the text file be searchable, but for practical purposes it doesn't make sense to index every word. Some words like "a", "and", and "the" are generally considered irrelevant for searching and can be optimized out -- these are called stop words.
Lucene Some Inverted Index Strategies batch-based: use file-sorting algorithms (textbook) + fastest to build+ fastest to search- slow to update b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps) + fast to search- update/build does not scale
Lucene What you have to do Lucene handles the indexing, searching and retrieving, but it doesn't handle: managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files getting the search string from the user displaying the search results to the use - complex implementation segment based: lots of small indexes (Verity) + fast to build+ fast to update- slower to search
Lucene two basic algorithms: make an index for a single document merge a set of indices incremental algorithm: maintain a stack of segment indices create index for each incoming document push new indexes onto the stack let b=10 be the merge factor; M=∞ for (size = 1; size < M; size *= b) {if (there are b indexes with size docs on top of the stack) {pop them off the stack;merge them into a single index;push the merged index onto the stack;} else {break;}} optimization: single-doc indexes kept in RAM, saves system calls notes: Indexing In Depth
Lucene Lucene's Disjunctive Search Algorithm described in http://lucene.sf.net/papers/riao97.ps since all postings must be processed goal is to minimize per-posting computation merges postings through a fixed-size array of accumulator buckets performs boolean logic with bit masks scales well with large queries
Lucene Summary Lucene For key value pair fast retrieval. For read oriented data.