1 / 26

Lucene Part1 ‏

Learn how Lucene stores data in a 2-dimensional way using inverted indexes. Discover efficient indexing techniques to enhance search performance.

mauricef
Download Presentation

Lucene Part1 ‏

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene Part1 ‏

  2. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y

  3. Lucene Couple of Problems Relational Database – 1971 – Dr. E.F. Codd Excellent way to store 2 dimensions Null Inapplicable Science of Data.... The Study of Null Person HockeyTeam Name Person_Id Address Team Id

  4. Lucene Example Person HockeyTeam Stephen 5 Philadelphia Flyers 5

  5. Lucene Inverted Indexes Person HockeyTeam Stephen 5 Philadelphia Flyers 5 We can easily answer the question, where does Stephen live, but what about give me all the people living in Philadelphia. We can do this.

  6. Lucene Inverted Indexes Storage contains Philadelphia ------ Stephen, Chris, Tara

  7. Lucene Document Centric Create Documents on the Fly Add field called Category - --- table name Lucene indexes everything.

  8. Lucene One Developer Says Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowse's requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered.

  9. Lucene How search engines work Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document

  10. Lucene Building A Search Index If you think of the postings as tuples of the form <word, document-id>, a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word.

  11. Lucene An Innovative Implementation Most search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes.

  12. Lucene Preventing Conflicts To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching.

  13. Lucene Index Segment A Lucene index segment consists of several files: A dictionary index containing one entry for each 100 entries in the dictionary A dictionary containing one entry for each unique word A postings file containing an entry for each posting

  14. Lucene Flat Files Since Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.

  15. Lucene Using Lucene - Create an Index The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line.

  16. Lucene The Code public class CreateIndex { // usage: CreateIndex index-directory public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; // An index is created by opening an IndexWriter with // create argument set to true. writer = new IndexWriter(indexPath, null, true); writer.close(); } }

  17. Lucene Index Text Documents IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene's point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message's subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains Gosling."

  18. Lucene Indexing In Depth In the code, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the file's contents.

  19. Lucene public class IndexFiles { // usage: IndexFiles index-path file . . . public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false); for (int i=1; i<args.length; i++) { System.out.println("Indexing file " + args[i]); InputStream is = new FileInputStream(args[i]); // We create a Document with two Fields, one which contains // the file path, and one the file's contents. Document doc = new Document(); doc.add(Field.UnIndexed("path", args[i])); doc.add(Field.Text("body", (Reader) new InputStreamReader(is))); writer.addDocument(doc); is.close(); }; writer.close(); } } Code Example

  20. Lucene Search Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score.

  21. Lucene Code Example public class Search { public static void main(String[] args) throws Exception { String indexPath = args[0], queryString = args[1]; Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(queryString, "body", new SimpleAnalyzer()); Hits hits = searcher.search(query); for (int i=0; i<hits.length(); i++) { System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i)); }; } }

  22. Lucene Query Parsing The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query-building constructs provided. The query parser can parse queries like these: free AND "text search" Search for documents containing "free" and the phrase "text search" +text search Search for documents containing "text" and preferentially containing "search" giants -football Search for "giants" but omit documents containing "football" author:gosling java Search for documents containing "gosling" in the author field and "java" in the body

  23. Lucene Beyond Basic Text Documents Lucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens. .

  24. Lucene Token Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these aren't enough, the developer can provide more sophisticated analyzers.

  25. Lucene Analyzer The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the document's content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream), as well as an application-provided FilterInputStream to parse the XML and extract the desired content.

  26. Lucene Summary Lucene is the most flexible and convenient open source search toolkit I've ever used. Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result. The design seems so simple, you might suspect it is just the obvious way to design a search toolkit. We should all be so lucky as to craft such obvious designs for our own software.

More Related