200 likes | 342 Views
Introduction to Open Source Search with Apache Lucene and Solr. Grant Ingersoll. The How Many Game. How many of you: Have taken a class in Information Retrieval (IR)? Are doing work/research in IR? Have heard of or are using Lucene? Have heard of or are using Solr?
E N D
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll
The How Many Game • How many of you: • Have taken a class in Information Retrieval (IR)? • Are doing work/research in IR? • Have heard of or are using Lucene? • Have heard of or are using Solr? • Are doing work on core IR algorithms such as compression techniques or scoring? • Are doing UI/Application work/research as they relate to search?
Topics • Brief Bio • Search 101 (skip?) • What is: • Apache Lucene • Apache Solr • What can they do? • Features and functionality • Intangibles • What’s new in Lucene and Solr? • How can they help my research/work/____?
Brief Bio • Apache Lucene/Solr Committer • Apache Mahout co-founder • Scalable Machine Learning • Co-founder of Lucid Imagination • http://www.lucidimagination.com • Previously worked at Center for Natural Lang. Processing at Syracuse Univ. with Dr. Liddy • Co-Author of upcoming “Taming Text” (Manning Publications) • http://www.manning.com/ingersoll
Search 101 • Search tools are designed for dealing with fuzzy data/questions • Works well with structured and unstructured data • Performs well when dealing with large volumes of data • Many apps don’t need the limits that databases place on content • Search fits well alongside a DB too • Given a user’s information need, (query) find and, optionally, score content relevant to that need • Many different ways to solve this problem, each with tradeoffs • What’s “relevant” mean?
Search 101 Relevance Indexing Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve Vector Space Model (VSM) for relevance • Common across many search engines • Apache Lucene is a highly optimized implementation of the VSM
Apache Lucene in a Nutshell • http://lucene.apache.org/java • Java based Application Programming Interface (API) for adding search and indexing functionality to applications • Fast and efficient scoring and indexing algorithms • Lots of contributions to make common tasks easier: • Highlighting, spatial, Query Parsers, Benchmarking tools, etc. • Most widely deployed search library on the planet
Lucene Basics • Content is modeled via Documents and Fields • Content can be text, integers, floats, dates, custom • Analysis can be employed to alter content before indexing • Searches are supported through a wide range of Query options • Keyword • Terms • Phrases • Wildcards • Many, many more
Apache Solr in a Nutshell • http://lucene.apache.org/solr • Lucene-based Search Server + other features and functionality • Access Lucene over HTTP: • Java, XML, Ruby, Python, .NET, JSON, PHP, etc. • Most programming tasks in Lucene are configuration tasks in Solr • Faceting (guided navigation, filters, etc.) • Replication and distributed search support • Lucene Best Practices
Quick Solr/Lucene Demo • Pre-reqs: • Apache Ant 1.7.x, Subversion (SVN) • Command Line 1: • svn co https://svn.apache.org/repos/asf/lucene/dev/trunksolr-trunk • cdsolr-trunk/solr/ • ant example • cd example • java –Dsolr.clustering.enabled=true –jar start.jar • Command Line 2 • cd exampledocs; java –jar post.jar *.xml • http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
Other Features • Data Import Handler • Database, Mail, RSS, etc. • Rich document support via Apache Tika • PDF, MS Office, Images, etc. • Replication for high query volume • Distributed search for large indexes • Production systems with 1B+ documents • Configurable Analysis chain and other extension points • Total control over tokenization, stemming, etc.
Intangibles • Open Source • Flexible, non-restrictive license • Apache License v2 – non-viral • “Do what you want with the software, just don’t claim you wrote it” • Large community willing to help • Great place to learn about real world IR systems • Many books and other documentation • Lucene in Action by Hatcher, McCandless and Gospodnetic
What’s New? • https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/CHANGES.txt • https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/CHANGES.txt • Codecs • Pluggable Index Formats • Provide Different index compression techniques • Stats to enable alternate scoring approaches • BM25, Lang. Modeling, etc. -- More work to be done here • Faster • Java Strings are slow; convert to use byte arrays
Other New Items • Many new Analyzers (tokenizers, etc.) • Richer Language support (Hindi, Indonesian, Arabic, …) • Richer Geospatial (Local) Search capabilities • Score, filter, sort by distance • http://wiki.apache.org/solr/SpatialSearch • Results Grouping • Group Related Results • http://wiki.apache.org/solr/FieldCollapsing • More Faceting Capabilities • Pivot • New underlying algorithms
Job Trends http://www.indeed.com
Other Things that Can Help • Nutch • Crawling • http://nutch.apache.org • Mahout • Machine learning (clustering, classification, others) • http://mahout.apache.org • OpenNLP • Part of Speech, Parsers, Named Entity Recognition • http://incubator.apache.org/opennlp • Open Relevance Project • Relevance Judgments • http://lucene.apache.org/openrelevance
Resources • http://lucene.apache.org • http://www.lucidimagination.com • {java-user|solr-user}@lucene.apache.org • @gsingers • http://www.slideshare.net/gsingers • grant@lucidimagination.com