60 likes | 208 Views
Searching with Lucene. Chapter 2. For discussion. Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm . Information Retrieval. Consider a collection of documents You want to know what words are in each of the documents
E N D
Searching with Lucene Chapter 2
For discussion • Information retrieval • What is Lucene? • Code for indexer using Lucene • Pagerank algorithm
Information Retrieval • Consider a collection of documents • You want to know what words are in each of the documents • Given a word you want to know which document it occurs • You want to know how many times a word occurs in document. • You want to rank documents according to count
What is Lucene? • Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. • It is supported by the Apache Software Foundation and is released under the Apache Software License • It does indexing at lightning speed. • Lucene experience lead to the development of Hadoop (by Doug Cutting).
Why do need to study it? • But search is more than indexing: link analysis, click analysis, natural language processing, latent dirichlet allocation (LDA),…page rank,… • We are interested in data-intensive computing algorithm such as mapreduce and data structure such as Google file systems. • Algorithms we discuss in the context of Lucene could all be converted to data-intensive methods for improving performance and scalability.