120 likes | 285 Views
Lucene in Action. For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei. What is Lucene.
E N D
Lucene in Action For ITCS 6265 Professor: Wensheng Wu Present by TA: XuFei
What is Lucene • “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. ” • high performance, scalable Information Retrieval (IR) library. • a project in the Apache Software Foundation • mature, free, open-source • implemented in Java.
full-text indexing and searching • “In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. ” • “Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. ”
Lucene is popular • a number of ports or integrations to other programming languages • C/C++, C#, Ruby, Perl, Python, PHP, etc. • 1500+ installations: • HP, FedEx, Iron Mountain, Akamai, DSpace, IBM/Yahoo, Healthline, Webmail, CNET, Lookout (acquired by Microsoft), webshots.com (100M docs, 4M queries/day), Siderean, Monster….
Lucene is just a hammer! • NOT a ready-to-use search application, like Google • a software library, a toolkit • a single compact JAR file (less than 1 MB!) • A number of full-featured search applications have been built on top of Lucene.
What Lucene can do for you • add search capabilities to your application • index and make searchable any data that you can extract text from • Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it. • You can even index data stored in your databases, indirectly!
Search Application • Components for indexing • Acquire Content • Build Document • Analyze Document • Index Document • Components for searching • Search User Interface • Build Query • Search Query • Render Results • Others • Administration Interface • Analytics Interface • Scaleout Figure 1. Typical components of search application; the shaded components show which parts Lucene handles.
Ranking formula score(Q,D) = coord(Q,D) · queryNorm(Q) · ∑ t in Q (tf(t in D) · idf(t)2 · t.getBoost() · norm(D) ) • tf–idf weight (term frequency–inverse document frequency)
Key index files in Lucene • Segments file • Fields information file • Text information file • Frequency file • Position file
Inverted Index Example Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State
Demo • How to install Lucene and run the demo • Boolean retrieval example • apache – lucene • apache + lucene • apache lucene • Luke: http://www.getopt.org/luke/ • A online demo (PHP + Lucene) : http://tiny.cc/JCA9K
Reference: • Lucene: http://lucene.apache.org/ • Apache: http://www.apache.org/ • “Lucene in Action” Chapter 1 and code: Link • Lucene index: http://www.ibm.com/developerworks/library/wa-lucene/ • http://lucene.apache.org/java/2_4_0/scoring.html • http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html • http://en.wikipedia.org/wiki/Full_text_search • http://en.wikipedia.org/wiki/Index_%28search_engine%29 • http://en.wikipedia.org/wiki/Tf-idf