Advanced Lucene

1. Advanced Lucene Grant Ingersoll Ozgur Yilmazel Niranjan Balasubramanian Steve Rowe Svetlana Smoyenko Center for Natural Language Processing ApacheCon 2005 December 12, 2005

2. Overview CNLP Intro Quick Lucene Refresher Term Vectors Span Queries Building an IR Framework Case Studies from CNLP Collection analysis for domain specialization Crosslingual/multilingual retrieval in Arabic, English and Dutch Sublanguage analysis for commercial trouble ticket analysis Passage retrieval and analysis for Question Answering application Resources

3. Center for Natural Language Processing Self-supporting university research and development center Experienced, multidisciplinary team of information scientists, computer scientists, and linguists Many with substantial commercial software experience Research, development & licensing of NLP-based technology for government & industry ARDA, DARPA, NSA, CIA, DHS, DOJ, NIH Raytheon, SAIC, Boeing, ConEd, MySentient, Unilever For numerous applications: Document Retrieval Question-Answering Information Extraction / Text Mining Automatic Metadata Generation Cross-Language Information Retrieval 60+ projects in past 6 years

4. Lucene Refresher Indexing A Document is a collection of Fields A Field is free text, keywords, dates, etc. A Field can have several characteristics indexed, tokenized, stored, term vectors Apply Analyzer to alter Tokens during indexing Stemming Stopword removal Phrase identification For more details on Lucene, go to http://lucene.apache.orgFor more details on Lucene, go to http://lucene.apache.org

5. Lucene Refresher Searching Uses a modified Vector Space Model (VSM) We convert a user�s query into an internal representation that can be searched against the Index Queries are usually analyzed in the same manner as the index Get back Hits and use in an application To learn more about how Lucene scores documents, see the Similarity class in the Lucene javadocs To learn more about how Lucene scores documents, see the Similarity class in the Lucene javadocs

6. Lucene Demo++ Based on original Lucene demo Sample, working code for: Search Relevance Feedback Span Queries Collection Analysis Candidate Identification for QA Requires: Lucene 1.9 RC1-dev Available at http://www.cnlp.org/apachecon2005 In the demo, we took the code from the original demo and set it up as a Servlet that forwards to a JSP, instead of having all the code in the JSP. Highlights of changes and additions: Indexing: Extended HTMLDocument called TVHTMLDocument The contents field stores the Term Vector, including the position and offset information The title field stores the Term Vector without position or offset info. Searching: Added checkboxes and submit button to the results page to allow the user to select relevant documents for use in relevance feedback Added code to do basic relevance feedback. Term Vectors See TermVectorTest for simple TermVector usages See IndexAnalysis for simple term co-occurrence calculations Span Query See SpanQueryTest for some simple examples QA Basic Candidate identification is done in the QAService class All examples are written against the Lucene Subversion trunk as of October. Some code may be subject to change due to new patches. In the demo, we took the code from the original demo and set it up as a Servlet that forwards to a JSP, instead of having all the code in the JSP. Highlights of changes and additions: Indexing: Extended HTMLDocument called TVHTMLDocument The contents field stores the Term Vector, including the position and offset information The title field stores the Term Vector without position or offset info. Searching: Added checkboxes and submit button to the results page to allow the user to select relevant documents for use in relevance feedback Added code to do basic relevance feedback. Term Vectors See TermVectorTest for simple TermVector usages See IndexAnalysis for simple term co-occurrence calculations Span Query See SpanQueryTest for some simple examples QA Basic Candidate identification is done in the QAService class All examples are written against the Lucene Subversion trunk as of October. Some code may be subject to change due to new patches.

7. Term Vectors Relevance Feedback and �More Like This� Domain specialization Clustering Cosine similarity between two documents Highlighter Needs offset info Cosine Similarity People on the mailing list are often wondering how to get the cosine similarity between two documentsCosine Similarity People on the mailing list are often wondering how to get the cosine similarity between two documents

8. In Lucene, a TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance As a tuple: termFreq = <term, term countD> <fieldName, <�,termFreqi, termFreqi+1,�>> As Java: public String getField(); public String[] getTerms(); public int[] getTermFrequencies(); Lucene Term Vectors (TV) Term Vectors were officially added in the 1.4 Look at TermFreqVector interface definition The getTerms() and getTermFrequencies() are parallel arrays, that is getTerms()[i] has a document frequency of getTermFrequencies()[i]Term Vectors were officially added in the 1.4 Look at TermFreqVector interface definition The getTerms() and getTermFrequencies() are parallel arrays, that is getTerms()[i] has a document frequency of getTermFrequencies()[i]

9. Creating Term Vectors During indexing, create a Field that stores Term Vectors: new Field("title", parser.getTitle(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES); Options are: Field.TermVector.YES Field.TermVector.NO Field.TermVector.WITH_POSITIONS � Token Position Field.TermVector.WITH_OFFSETS � Character offsets Field.TermVector.WITH_POSITIONS_OFFSETS See the IndexHTML and TVHTMLDocument class in org.cnlp.apachecon.index for samples Token positions in Lucene are increments relative to previous tokens. Two tokens can have the same position. Analyzers can set the position as desired. Thus, two tokens that are adjacent in the original text need not have a difference in positions of only one. Field.TermVector.NO � Other Field constructors that don�t accept a TermVector parameter assume Field.TermVector.NO See the IndexHTML and TVHTMLDocument class in org.cnlp.apachecon.index for samples Token positions in Lucene are increments relative to previous tokens. Two tokens can have the same position. Analyzers can set the position as desired. Thus, two tokens that are adjacent in the original text need not have a difference in positions of only one. Field.TermVector.NO � Other Field constructors that don�t accept a TermVector parameter assume Field.TermVector.NO

10. Accessing Term Vectors Term Vectors are acquired from the IndexReader using: TermFreqVector getTermFreqVector(int docNumber, String field) TermFreqVector[] getTermFreqVectors(int docNumber) Can be cast to TermPositionVector if the vector was created with offset or position information TermPositionVector API: int[] getTermPositions(int index); TermVectorOffsetInfo [] getOffsets(int index); See TermVectorTest in example code See SearchServlet in org.cnlp.apachecon.search for examples TermPositionVector Depending on construction options, positions or offsets may be nullSee TermVectorTest in example code See SearchServlet in org.cnlp.apachecon.search for examples TermPositionVector Depending on construction options, positions or offsets may be null

11. Relevance Feedback Expand the original query using terms from documents Manual Feedback: User selects which documents are most relevant and, optionally, non-relevant Get the terms from the term vector for each of the documents and construct a new query Can apply boosts based on term frequencies Automatic Feedback Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents See Modern Information Retrieval by Baeza-Yates, et. al. for in-depth discussion of feedback See the method performFeedback() in the SearchServlet Not possible to do �pure� negative feedback in Lucene, as it does not support negative boost values for terms. Can, however, decrement the boost of a given term or remove it if it appears in one of the negative examples.See the method performFeedback() in the SearchServlet Not possible to do �pure� negative feedback in Lucene, as it does not support negative boost values for terms. Can, however, decrement the boost of a given term or remove it if it appears in one of the negative examples.

12. Example From Demo, SearchServlet.java Code to get the top X terms from a TV protected Collection getTopTerms(TermFreqVector tfv, int numTermsToReturn) { String[] terms = tfv.getTerms();//get the terms int [] freqs = tfv.getTermFrequencies();//get the frequencies List result = new ArrayList(terms.length); for (int i = 0; i < terms.length; i++) { //create a container for the Term and Frequency information result.add(new TermFreq(terms[i], freqs[i])); } Collections.sort(result, comparator);//sort by frequency if (numTermsToReturn < result.size()) { result = result.subList(0, numTermsToReturn); } return result; } Taken from org.cnlp.apachecon.servlets.SearchServlet Examples of using Position and Offset info are in org.cnlp.apachecon.index.IndexAnalysis Take a look at running demo:Taken from org.cnlp.apachecon.servlets.SearchServlet Examples of using Position and Offset info are in org.cnlp.apachecon.index.IndexAnalysis Take a look at running demo:

13. Span Queries Provide info about where a match took place within a document SpanTermQuery is the building block for more complicated queries Other SpanQuery classes: SpanFirstQuery, SpanNearQuery, SpanNotQuery, SpanOrQuery, SpanRegexQuery Take a look at Java docs for SpanQuery, Spans and derived classes In the demo code, see the SpanQueryTest. Also, many good test cases in the Lucene codeTake a look at Java docs for SpanQuery, Spans and derived classes In the demo code, see the SpanQueryTest. Also, many good test cases in the Lucene code

14. Spans The Spans object provides document and position info about the current match From SpanQuery: Spans getSpans(IndexReader reader) Interface definitions: boolean next() //Move to the next match int doc()//the doc id of the match int start()//The match start position int end() //The match end position boolean skipTo(int target)//skip to a doc From Spans.java in the Lucene source skipTo skips to the next match with a doc id greater than or equal to targetFrom Spans.java in the Lucene source skipTo skips to the next match with a doc id greater than or equal to target

15. Phrase Matching SpanNearQuery provides functionality similar to PhraseQuery Use position distance instead of edit distance Advantages: Less slop is required to match terms Can be built up of other SpanQuery instances Edit distance is the number of moves it takes to move from one term to the next. Position distance is the difference between the positions of the two terms Less Slop allows for more precision PhraseQuery requires at least two extra positions in order to capture reversed terms.Edit distance is the number of moves it takes to move from one term to the next. Position distance is the difference between the positions of the two terms Less Slop allows for more precision PhraseQuery requires at least two extra positions in order to capture reversed terms.

16. Example SpanTermQuery section = new SpanTermQuery(new Term(�test�, �section�); SpanTermQuery lucene = new SpanTermQuery(new Term(�test�, �Lucene�); SpanTermQuery dev = new SpanTermQuery(new Term(�test�, �developers�); SpanFirstQuery first = new SpanFirstQuery(section, 2); Spans spans = first.getSpans(indexReader);//do something with the spans SpanQuery [] clauses = {lucene, dev}; SpanNearQuery near = new SpanNearQuery(clauses, 2, true); spans = first.getSpans(indexReader);//do something with the spans clauses = new SpanQuery[]{section, near}; SpanNearQuery allNear = new SpanNearQuery(clauses, 3, false); spans = allNear.getSpans(indexReader);//do something with the spans See SpanQueryTest.java in the demo for other alternativesSee SpanQueryTest.java in the demo for other alternatives

17. What is QA? Return an answer, not just a list of hits that the user has to sift through Often built on top of IR Systems: IR: Query usually has the terms you are interested in QA: Query usually does not have the terms you are interested in Question ambiguity: Who is the President? Future of search?!? IR Systems: Take the user�s question, do a straight IR search along with some question analysis. Then post process the IR hits to determine if they satisfy the needs of the question Question Ambiguity: Who is the president of what? Of the United States? President of some company? Questions can be phrased many different ways, but all be looking for the same kind of thing. Who is the founder of Lucene? -- Person What is the name of the person who founded Lucene? � Person Also depends on what the indexed collection contains Future of Search: Most users only look at the first page of results, and of those, will probably only click through to a hit if the relevant section(s) looks good. Thus, why not try and build a system that determines if the relevant section is goodIR Systems: Take the user�s question, do a straight IR search along with some question analysis. Then post process the IR hits to determine if they satisfy the needs of the question Question Ambiguity: Who is the president of what? Of the United States? President of some company? Questions can be phrased many different ways, but all be looking for the same kind of thing. Who is the founder of Lucene? -- Person What is the name of the person who founded Lucene? � Person Also depends on what the indexed collection contains Future of Search: Most users only look at the first page of results, and of those, will probably only click through to a hit if the relevant section(s) looks good. Thus, why not try and build a system that determines if the relevant section is good

18. QA Needs Candidate Identification Determine Question types and focus Person, place, yes/no, number, etc. Document Analysis Determine if candidates align with the question Answering non-collection based questions Results display We will focus in on Candidate Identification in the context of Lucene, but these other things are things to think about in the context of a real QA system Answer Types: How many committers are there on the Lucene project? -- Looking for a number Who are the committers on the Lucene Project? -- List of people Is Doug Cutting the founder of Lucene? -- yes/no Document Analysis can be quite in-depth, involving extracting pertinent bits from a document and seeing if they align with the type of answer we are looking for. Results Display -- do you simply return the results or do you try to generate yes/no answers, etc. See �Finding Answers to Complex Questions� by Diekema, et. al. in New Directions in Question Answering, edited by Mark T. Maybury for backgroundWe will focus in on Candidate Identification in the context of Lucene, but these other things are things to think about in the context of a real QA system Answer Types: How many committers are there on the Lucene project? -- Looking for a number Who are the committers on the Lucene Project? -- List of people Is Doug Cutting the founder of Lucene? -- yes/no Document Analysis can be quite in-depth, involving extracting pertinent bits from a document and seeing if they align with the type of answer we are looking for. Results Display -- do you simply return the results or do you try to generate yes/no answers, etc. See �Finding Answers to Complex Questions� by Diekema, et. al. in New Directions in Question Answering, edited by Mark T. Maybury for background

19. Candidate Identification for QA A candidate is a potential answer for a question Can be as short as a word or as large as multiple documents, based on system goals Example: Q: How many Lucene properties can be tuned? Candidate: Lucene has a number of properties that can be tuned.

20. Span Queries and QA Retrieve candidates using SpanNearQuery, built out of the keywords of the user�s question Expand to include window larger than span Score the sections based on system criteria Keywords, distances, ordering, others Sample Code in QAService.java in demo See demo code. QAService.getCandidates. In this code we execute the span query, determine the relevant sections by reconstructing the document and then score the results. We use a SpanNearQuery with a fairly large slop factor See demo code. QAService.getCandidates. In this code we execute the span query, determine the relevant sections by reconstructing the document and then score the results. We use a SpanNearQuery with a fairly large slop factor

21. Example QueryTermVector queryVec = new QueryTermVector(question, analyzer); String [] terms = queryVec.getTerms(); int [] freqs = queryVec.getTermFrequencies(); SpanQuery [] clauses = new SpanQuery[terms.length]; for (int i = 0; i < terms.length; i++) { String term = terms[i]; SpanTermQuery termQuery = new SpanTermQuery(new Term("contents", term)); termQuery.setBoost(freqs[i]); clauses[i] = termQuery; } SpanQuery query = new SpanNearQuery(clauses, 15, false); Spans spans = query.getSpans(indexReader); //Now loop through the spans and analyze the documents From QAService.java in the demo In this example, we take in the question, convert it to a QueryTermVector so that we can build an array of SpanQuerys out of the individual terms. The SpanNearQuery has a window of 15 positions, which is arbitrary. Depending on goals, may want to change.From QAService.java in the demo In this example, we take in the question, convert it to a QueryTermVector so that we can build an array of SpanQuerys out of the individual terms. The SpanNearQuery has a window of 15 positions, which is arbitrary. Depending on goals, may want to change.

22. Tying It All Together To motivate the next slide, Building an IR Framework, provide the context of some of the things we do at CNLP and why we need an IR Framework Collection Manager � Provides information about the original files, where they are stored, status of different stages of processing, etc. TextTagger � Provides rich NLP features across the seven levels of linguistic analysis (Phonological, Morphological, Lexical, Syntactic, Semantic, Discourse, and Pragmatic) Extractor � Provides contextual info, co-occurrence information, etc. for a collection QA � Provides question answering capabilities L2L � Language to Logic � converts natural language questions into logical representation Authoring -- Provides tools for customizing systems for domains IRTools � discussed in next slideTo motivate the next slide, Building an IR Framework, provide the context of some of the things we do at CNLP and why we need an IR Framework Collection Manager � Provides information about the original files, where they are stored, status of different stages of processing, etc. TextTagger � Provides rich NLP features across the seven levels of linguistic analysis (Phonological, Morphological, Lexical, Syntactic, Semantic, Discourse, and Pragmatic) Extractor � Provides contextual info, co-occurrence information, etc. for a collection QA � Provides question answering capabilities L2L � Language to Logic � converts natural language questions into logical representation Authoring -- Provides tools for customizing systems for domains IRTools � discussed in next slide

23. Building an IR Framework Goals: Support rapid experimentation and deployment Configurable by non-programmers Pluggable search engines (Lucene, Lucene LM, others) Indexing and searching Relevance Feedback (manual and automatic) Multilingual support Common search API across projects CNLP is a research-based organization with the need to be able to experiment with many different settings, etc. and to be able to rapidly build system to test out research hypotheses. Furthermore, we also have commercial clients who demand performance (both speed and quality) Lucene LM � Language Modeling for Lucene -- http://ilps.science.uva.nl/Resources/ Needs to be configurable so Analysts can change properties, re-index and test new hypotheses. CNLP is a research-based organization with the need to be able to experiment with many different settings, etc. and to be able to rapidly build system to test out research hypotheses. Furthermore, we also have commercial clients who demand performance (both speed and quality) Lucene LM � Language Modeling for Lucene -- http://ilps.science.uva.nl/Resources/ Needs to be configurable so Analysts can change properties, re-index and test new hypotheses.

24. Lucene IR Tools Write implementations for Interfaces Indexing Searching Analysis Wrap Filters and Tokenizers, providing Bean properties for configuration Explanations Use Digester to create Uses reflection, but only during startup In order to fit Lucene into IRTools, we had to wrap a few key items to make them easier to configure During runtime, we take the wrapper and call an appropriate method to instantiate the real Filter or Tokenizer. Wrappers are reusable, underlying TokenStream is notIn order to fit Lucene into IRTools, we had to wrap a few key items to make them easier to configure During runtime, we take the wrapper and call an appropriate method to instantiate the real Filter or Tokenizer. Wrappers are reusable, underlying TokenStream is not

25. Sample Configuration Declare a Tokenizer: <tokenizer name="standardTokenizer" class="StandardTokenizerWrapper"/> Declare a Token Filter: <filter name="stop" class="StopFilterWrapper� ignoreCase="true� stopFile="stopwords.dat"/> Declare an Analyzer: <analyzer class="ConfigurableAnalyzer�> <name>test</name> <tokenizer>standardTokenizer</tokenizer> <filter>stop</filter> </analyzer> Can also use existing Lucene Analyzers Sample Digester configuration. Class package names have been removed for formatting reasons. If an analyst wants a different filter, they need just declare it and then include it in the appropriate Analyzer.Sample Digester configuration. Class package names have been removed for formatting reasons. If an analyst wants a different filter, they need just declare it and then include it in the appropriate Analyzer.

26. Case Studies Domain Specialization AIR (Arabic Information Retrieval) Commercial QA Trouble Ticket Analysis

27. Domain Specialization At CNLP, we often work within specific domains

28. Domain Specialization We provide the SME with tools to tailor our out of the box processing for their domain Users often want to see NLP features like: Phrase and entity identification Phrases and entities in context of other terms Entity categorizations Occurrences and co-occurrences Upon examining, users can choose to alter (or remove) how terms were processed Benefits Less ambiguity yields better results for searches and extractions SME has a better understanding of collection SMEs will often change how Categorizations are applied. For instance, if in their collection, the word bank should always be a categorized as �an area of land� instead of a �Financial Institution�, they could tell our system to so and then re-index and all appropriate instances would be treated as an area of land.SMEs will often change how Categorizations are applied. For instance, if in their collection, the word bank should always be a categorized as �an area of land� instead of a �Financial Institution�, they could tell our system to so and then re-index and all appropriate instances would be treated as an area of land.

29. KBB KBB � Knowledge Base Builder Uses Lucene to calculate much of the co-reference data. SME inputs a collection which is then run through TextTagger (NLP tool). After it has been processed we use Lucene to index the TextTagger output and to calculate the co-reference dataKBB � Knowledge Base Builder Uses Lucene to calculate much of the co-reference data. SME inputs a collection which is then run through TextTagger (NLP tool). After it has been processed we use Lucene to index the TextTagger output and to calculate the co-reference data

30. Lucene and Domains Build an index of NLP features Several Alternatives for domain analysis: Store Term Vectors (including position and offset) and loop through selected documents Custom Analyzer/filter produces Lucene tokens and calculates counts, etc. Span Queries for certain features such as term co-occurrence Index can contain mappings like: Term -> co-occurrences w/ other terms Part Of Speech -> terms Categorizations -> phrases Index can contain mappings like: Term -> co-occurrences w/ other terms Part Of Speech -> terms Categorizations -> phrases

31. Arabic Info. Retrieval(AIR)

32. AIR and Lucene AIR uses IR Tools Wrote several different filters for Arabic and English to provide stemming and phrase identification One index per language Do query analysis on both the Source and Target language Easily extendable to other languages Wrote several Arabic Stemmers to experiment with both light and aggressive stemmers as well as for looking up whether a token is a phrase One Index Per Language Often on the mailing list there are questions about whether to store everything in one index or in multiple. We chose multiple because IRTools makes the simplifying assumption that each index is monolingual, which allows for easier managementWrote several Arabic Stemmers to experiment with both light and aggressive stemmers as well as for looking up whether a token is a phrase One Index Per Language Often on the mailing list there are questions about whether to store everything in one index or in multiple. We chose multiple because IRTools makes the simplifying assumption that each index is monolingual, which allows for easier management

34. QA for CRM Provide NLP based QA for use in CRM applications Index knowledge based documents, plus FAQs and PAQs Uses Lucene to identify passages which are then processed by QA system to determine answers Index contains rich NLP features Set boosts on Query according to importance of term in query. Boosts determined by our query analysis system (L2L) We use the score from Lucene as a factor in scoring answers L2L http://www.cnlp.org/cnlp.asp?m=4&sm=24 L2L http://www.cnlp.org/cnlp.asp?m=4&sm=24

35. Trouble Ticket Analysis Client has Trouble Tickets (TT) which are a combination of: Fixed domain fields such as checkboxes Free text, often written using shorthand Objective: Use NLP to discover patterns and trends across TT collection enabling better categorization

36. Using Lucene for Trouble Tickets Write custom Filters and Tokenizers to process shorthand Tokenizer: Certain Symbols and whitespace delineate tokens Filters Standardize shorthand Addresses, dates, equipment and structure tags Regular expression stopword removal Use high frequency terms to identify lexical clues for use in trend identification by TextTagger Browsing collection to gain an understanding of domain

37. Resources Demos available http://lucene.apache.org Mailing List java-user@lucene.apache.org CNLP http://www.cnlp.org http://www.cnlp.org/apachecon2005 Lucene In Action http://www.lucenebook.com Special thanks to: Ozgur Yilmazel, Niranjan Balasubramanian, Steve Rowe and Svetlana Symonenko from CNLP!Special thanks to: Ozgur Yilmazel, Niranjan Balasubramanian, Steve Rowe and Svetlana Symonenko from CNLP!

Advanced Lucene

Advanced Lucene

Presentation Transcript

Lucene-Demo

Apache Lucene

Apache Lucene

Lucene

Lucene

Lucene (Concluded) ‏

Advanced Lucene

Lucene in action

Lucene Tutorial

Lucene Performance

Apache Lucene

Lucene/SOLR 2: Lucene search API

Lucene

Lucene

Topic: Lucene

Lucene Part3 ‏

Lucene (Concluded) ‏

Lucene Homework