210 likes | 353 Views
Lucene. Brian Nisonger Feb 08,2006. What is it?. Doug Cutting’s grandmother’s middle name A open source set of Java Classses Search Engine/Document Classifier/Indexer http://lucene.sourceforge.net/talks/pisa/ Developed by Doug Cutting 1996 Xerox/Apple/Excite/Nutch
E N D
Lucene Brian Nisonger Feb 08,2006
What is it? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses • Search Engine/Document Classifier/Indexer • http://lucene.sourceforge.net/talks/pisa/ • Developed by Doug Cutting 1996 • Xerox/Apple/Excite/Nutch • Wrote several papers in IR
What is it-Nuts and Bolts • Modules for IR • Analysis • Tokenization • Where tokens are indexed • Document • Where the Document ID is created • Date of Document is extracted • Title of document is extracted
Nuts and Bolts -II • Modules-Con’t • Index • Provides access to indexes • Maintains indexes • Query Parser • Where the magic of query happens • Search • Searches across indexes
Nuts and Bolts-III • Modules-Con’t • Search Spans • Spans • K+/- words • Example: • Find me a document that has Rachael Ray and Alton Brown within 100 words of each other that also has the term cooking • Store/Util • Store the indexes and other housekeeping
Theory • Space Optimization for Total Ranking • Cutting et al 1996 • RAIO (Computer Assisted IR) 1997 • http://lucene.sf.net/papers/riao97.ps • Lucene lecture at Pisa • Doug Cutting • Slides from Lecture at University of Pisa 2004 • See previous link
Vector • Vectors are a mathematical distance between terms • Uses a cosine distance to determine how close terms/documents are • This distance can then be used for WSD/Clustering/IR • Example: • Bass,fishing: .6506 • Bass,guitar: .000423 • This tells us the document is about fishing not about guitars
Vectors-IR • “Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.” • http://www.perl.com/pub/a/2003/02/19/engine.html • Intro to Comp Ling and its applications to IR • Nisonger 2005 :P
Inverted Index • Term/Doc Id/Weight • Term • “A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied.” • http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html
Inverted Index –Con’t • Doc Id • A unique “key” that identifies each document • Weight • Binary • Freq Count • Weighting Algorithm
Index Merge • Basic/Basket/Basketball • Only keeps track of the differences between words • Periodically merges indexes • Allows new documents to be added easily
Query • Boolean Search • Only searches documents with at least 1 term in query • “Boolean Search Engine” • Parallel Search • Each term in query is search in parallel • Partial scores added to queue of docs
Query-II • Threshold • If partial score is too low and will not be part of N-best then the document is ignored even before search is complete • Example • Potential New Doc [0,0,0,0,0,0,i] • Document ranked 14 [233,202,109,100,i] • Potential New Doc is ignored • Small loss of recall greatly increases speed of search
Evaluation of Lucene • Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering • Tellex et al, MIT AI Lab 2003 • Compared Prise to Lucene for question and answer tasks • Question & Answer • <Who is the president?> <George W. Bush .76>
Evaluation-II • Prise • A IR system developed by NIS that according to the paper uses “modern” search engine techniques • Findings • Found Prise was better than Lucene since “Boolean” query engines are considered old school and its answers to questions were better
Eval-III • Lucene • Found although Prise had better correct answers Lucene found more documents containing relevant information
Eval-Conclusion • External Knowledge Sources for Question Answering • http://people.csail.mit.edu/gremio/publications/TREC2005.ps. • Katz et al, MIT Lab 2005 • MIT used Lucene in their 2005 TREC submission not Prise
Users • Lucene is used widely • TREC • Document Retrieval Enterprise Systems • Part of Database/Web engine • Part of Nutch • Used by academics for large projects • MIT, AI Lab • Know-It-All Project (UW)
Conclusions • Lucene is a good set of classes • Designed to allow customization without have to “reinvent the wheel” • Robust • Fast • Large development groups • Used Widely in Academia and Industry
Questions? • Feel free to ask questions, make comments, tell jokes.