360 likes | 736 Views
Overview. CNLP IntroQuick Lucene RefresherTerm VectorsSpan QueriesBuilding an IR FrameworkCase Studies from CNLPCollection analysis for domain specializationCrosslingual/multilingual retrieval in Arabic, English and DutchSublanguage analysis for commercial trouble ticket analysisPassage ret
E N D
1. Advanced Lucene Grant Ingersoll
Ozgur Yilmazel
Niranjan Balasubramanian
Steve Rowe
Svetlana Smoyenko
Center for Natural Language Processing
ApacheCon 2005
December 12, 2005
2. Overview CNLP Intro
Quick Lucene Refresher
Term Vectors
Span Queries
Building an IR Framework
Case Studies from CNLP
Collection analysis for domain specialization
Crosslingual/multilingual retrieval in Arabic, English and Dutch
Sublanguage analysis for commercial trouble ticket analysis
Passage retrieval and analysis for Question Answering application
Resources
3. Center for Natural Language Processing Self-supporting university research and development center
Experienced, multidisciplinary team of information scientists, computer scientists, and linguists
Many with substantial commercial software experience
Research, development & licensing of NLP-based technology for government & industry
ARDA, DARPA, NSA, CIA, DHS, DOJ, NIH
Raytheon, SAIC, Boeing, ConEd, MySentient, Unilever
For numerous applications:
Document Retrieval
Question-Answering
Information Extraction / Text Mining
Automatic Metadata Generation
Cross-Language Information Retrieval
60+ projects in past 6 years
4. Lucene Refresher Indexing
A Document is a collection of Fields
A Field is free text, keywords, dates, etc.
A Field can have several characteristics
indexed, tokenized, stored, term vectors
Apply Analyzer to alter Tokens during indexing
Stemming
Stopword removal
Phrase identification For more details on Lucene, go to http://lucene.apache.orgFor more details on Lucene, go to http://lucene.apache.org
5. Lucene Refresher Searching
Uses a modified Vector Space Model (VSM)
We convert a user’s query into an internal representation that can be searched against the Index
Queries are usually analyzed in the same manner as the index
Get back Hits and use in an application To learn more about how Lucene scores documents, see the Similarity class in the Lucene javadocs
To learn more about how Lucene scores documents, see the Similarity class in the Lucene javadocs
6. Lucene Demo++ Based on original Lucene demo
Sample, working code for:
Search
Relevance Feedback
Span Queries
Collection Analysis
Candidate Identification for QA
Requires: Lucene 1.9 RC1-dev
Available at http://www.cnlp.org/apachecon2005 In the demo, we took the code from the original demo and set it up as a Servlet that forwards to a JSP, instead of having all the code in the JSP.
Highlights of changes and additions:
Indexing:
Extended HTMLDocument called TVHTMLDocument
The contents field stores the Term Vector, including the position and offset information
The title field stores the Term Vector without position or offset info.
Searching:
Added checkboxes and submit button to the results page to allow the user to select relevant documents for use in relevance feedback
Added code to do basic relevance feedback.
Term Vectors
See TermVectorTest for simple TermVector usages
See IndexAnalysis for simple term co-occurrence calculations
Span Query
See SpanQueryTest for some simple examples
QA
Basic Candidate identification is done in the QAService class
All examples are written against the Lucene Subversion trunk as of October. Some code may be subject to change due to new patches.
In the demo, we took the code from the original demo and set it up as a Servlet that forwards to a JSP, instead of having all the code in the JSP.
Highlights of changes and additions:
Indexing:
Extended HTMLDocument called TVHTMLDocument
The contents field stores the Term Vector, including the position and offset information
The title field stores the Term Vector without position or offset info.
Searching:
Added checkboxes and submit button to the results page to allow the user to select relevant documents for use in relevance feedback
Added code to do basic relevance feedback.
Term Vectors
See TermVectorTest for simple TermVector usages
See IndexAnalysis for simple term co-occurrence calculations
Span Query
See SpanQueryTest for some simple examples
QA
Basic Candidate identification is done in the QAService class
All examples are written against the Lucene Subversion trunk as of October. Some code may be subject to change due to new patches.
7. Term Vectors Relevance Feedback and “More Like This”
Domain specialization
Clustering
Cosine similarity between two documents
Highlighter
Needs offset info Cosine Similarity
People on the mailing list are often wondering how to get the cosine similarity between two documentsCosine Similarity
People on the mailing list are often wondering how to get the cosine similarity between two documents
8. In Lucene, a TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance
As a tuple:
termFreq = <term, term countD>
<fieldName, <…,termFreqi, termFreqi+1,…>>
As Java:
public String getField();
public String[] getTerms();
public int[] getTermFrequencies(); Lucene Term Vectors (TV) Term Vectors were officially added in the 1.4
Look at TermFreqVector interface definition
The getTerms() and getTermFrequencies() are parallel arrays, that is getTerms()[i] has a document frequency of getTermFrequencies()[i]Term Vectors were officially added in the 1.4
Look at TermFreqVector interface definition
The getTerms() and getTermFrequencies() are parallel arrays, that is getTerms()[i] has a document frequency of getTermFrequencies()[i]
9. Creating Term Vectors During indexing, create a Field that stores Term Vectors:
new Field("title", parser.getTitle(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES);
Options are:
Field.TermVector.YES
Field.TermVector.NO
Field.TermVector.WITH_POSITIONS – Token Position
Field.TermVector.WITH_OFFSETS – Character offsets
Field.TermVector.WITH_POSITIONS_OFFSETS See the IndexHTML and TVHTMLDocument class in org.cnlp.apachecon.index for samples
Token positions in Lucene are increments relative to previous tokens. Two tokens can have the same position. Analyzers can set the position as desired. Thus, two tokens that are adjacent in the original text need not have a difference in positions of only one.
Field.TermVector.NO – Other Field constructors that don’t accept a TermVector parameter assume Field.TermVector.NO
See the IndexHTML and TVHTMLDocument class in org.cnlp.apachecon.index for samples
Token positions in Lucene are increments relative to previous tokens. Two tokens can have the same position. Analyzers can set the position as desired. Thus, two tokens that are adjacent in the original text need not have a difference in positions of only one.
Field.TermVector.NO – Other Field constructors that don’t accept a TermVector parameter assume Field.TermVector.NO
10. Accessing Term Vectors Term Vectors are acquired from the IndexReader using:
TermFreqVector getTermFreqVector(int docNumber,
String field)
TermFreqVector[] getTermFreqVectors(int docNumber)
Can be cast to TermPositionVector if the vector was created with offset or position information
TermPositionVector API:
int[] getTermPositions(int index);
TermVectorOffsetInfo [] getOffsets(int index); See TermVectorTest in example code
See SearchServlet in org.cnlp.apachecon.search for examples
TermPositionVector
Depending on construction options, positions or offsets may be nullSee TermVectorTest in example code
See SearchServlet in org.cnlp.apachecon.search for examples
TermPositionVector
Depending on construction options, positions or offsets may be null
11. Relevance Feedback Expand the original query using terms from documents
Manual Feedback:
User selects which documents are most relevant and, optionally, non-relevant
Get the terms from the term vector for each of the documents and construct a new query
Can apply boosts based on term frequencies
Automatic Feedback
Application assumes the top X documents are relevant and the bottom Y are non-relevant and constructs a new query based on the terms in those documents
See Modern Information Retrieval by Baeza-Yates, et. al. for in-depth discussion of feedback See the method performFeedback() in the SearchServlet
Not possible to do “pure” negative feedback in Lucene, as it does not support negative boost values for terms. Can, however, decrement the boost of a given term or remove it if it appears in one of the negative examples.See the method performFeedback() in the SearchServlet
Not possible to do “pure” negative feedback in Lucene, as it does not support negative boost values for terms. Can, however, decrement the boost of a given term or remove it if it appears in one of the negative examples.
12. Example From Demo, SearchServlet.java
Code to get the top X terms from a TV
protected Collection getTopTerms(TermFreqVector tfv, int numTermsToReturn)
{
String[] terms = tfv.getTerms();//get the terms
int [] freqs = tfv.getTermFrequencies();//get the frequencies
List result = new ArrayList(terms.length);
for (int i = 0; i < terms.length; i++)
{
//create a container for the Term and Frequency information
result.add(new TermFreq(terms[i], freqs[i]));
}
Collections.sort(result, comparator);//sort by frequency
if (numTermsToReturn < result.size())
{
result = result.subList(0, numTermsToReturn);
}
return result;
} Taken from org.cnlp.apachecon.servlets.SearchServlet
Examples of using Position and Offset info are in org.cnlp.apachecon.index.IndexAnalysis
Take a look at running demo:Taken from org.cnlp.apachecon.servlets.SearchServlet
Examples of using Position and Offset info are in org.cnlp.apachecon.index.IndexAnalysis
Take a look at running demo:
13. Span Queries Provide info about where a match took place within a document
SpanTermQuery is the building block for more complicated queries
Other SpanQuery classes:
SpanFirstQuery, SpanNearQuery, SpanNotQuery, SpanOrQuery, SpanRegexQuery Take a look at Java docs for SpanQuery, Spans and derived classes
In the demo code, see the SpanQueryTest. Also, many good test cases in the Lucene codeTake a look at Java docs for SpanQuery, Spans and derived classes
In the demo code, see the SpanQueryTest. Also, many good test cases in the Lucene code
14. Spans The Spans object provides document and position info about the current match
From SpanQuery:
Spans getSpans(IndexReader reader)
Interface definitions:
boolean next() //Move to the next match
int doc()//the doc id of the match
int start()//The match start position
int end() //The match end position
boolean skipTo(int target)//skip to a doc From Spans.java in the Lucene source
skipTo skips to the next match with a doc id greater than or equal to targetFrom Spans.java in the Lucene source
skipTo skips to the next match with a doc id greater than or equal to target
15. Phrase Matching SpanNearQuery provides functionality similar to PhraseQuery
Use position distance instead of edit distance
Advantages:
Less slop is required to match terms
Can be built up of other SpanQuery instances Edit distance is the number of moves it takes to move from one term to the next.
Position distance is the difference between the positions of the two terms
Less Slop allows for more precision
PhraseQuery requires at least two extra positions in order to capture reversed terms.Edit distance is the number of moves it takes to move from one term to the next.
Position distance is the difference between the positions of the two terms
Less Slop allows for more precision
PhraseQuery requires at least two extra positions in order to capture reversed terms.
16. Example SpanTermQuery section = new SpanTermQuery(new Term(“test”, “section”);
SpanTermQuery lucene = new SpanTermQuery(new Term(“test”, “Lucene”);
SpanTermQuery dev = new SpanTermQuery(new Term(“test”, “developers”);
SpanFirstQuery first = new SpanFirstQuery(section, 2);
Spans spans = first.getSpans(indexReader);//do something with the spans
SpanQuery [] clauses = {lucene, dev};
SpanNearQuery near = new SpanNearQuery(clauses, 2, true);
spans = first.getSpans(indexReader);//do something with the spans
clauses = new SpanQuery[]{section, near};
SpanNearQuery allNear = new SpanNearQuery(clauses, 3, false);
spans = allNear.getSpans(indexReader);//do something with the spans See SpanQueryTest.java in the demo for other alternativesSee SpanQueryTest.java in the demo for other alternatives
17. What is QA? Return an answer, not just a list of hits that the user has to sift through
Often built on top of IR Systems:
IR: Query usually has the terms you are interested in
QA: Query usually does not have the terms you are interested in
Question ambiguity:
Who is the President?
Future of search?!? IR Systems:
Take the user’s question, do a straight IR search along with some question analysis. Then post process the IR hits to determine if they satisfy the needs of the question
Question Ambiguity:
Who is the president of what? Of the United States? President of some company?
Questions can be phrased many different ways, but all be looking for the same kind of thing.
Who is the founder of Lucene? -- Person
What is the name of the person who founded Lucene? – Person
Also depends on what the indexed collection contains
Future of Search:
Most users only look at the first page of results, and of those, will probably only click through to a hit if the relevant section(s) looks good. Thus, why not try and build a system that determines if the relevant section is goodIR Systems:
Take the user’s question, do a straight IR search along with some question analysis. Then post process the IR hits to determine if they satisfy the needs of the question
Question Ambiguity:
Who is the president of what? Of the United States? President of some company?
Questions can be phrased many different ways, but all be looking for the same kind of thing.
Who is the founder of Lucene? -- Person
What is the name of the person who founded Lucene? – Person
Also depends on what the indexed collection contains
Future of Search:
Most users only look at the first page of results, and of those, will probably only click through to a hit if the relevant section(s) looks good. Thus, why not try and build a system that determines if the relevant section is good
18. QA Needs Candidate Identification
Determine Question types and focus
Person, place, yes/no, number, etc.
Document Analysis
Determine if candidates align with the question
Answering non-collection based questions
Results display We will focus in on Candidate Identification in the context of Lucene, but these other things are things to think about in the context of a real QA system
Answer Types:
How many committers are there on the Lucene project? -- Looking for a number
Who are the committers on the Lucene Project? -- List of people
Is Doug Cutting the founder of Lucene? -- yes/no
Document Analysis can be quite in-depth, involving extracting pertinent bits from a document and seeing if they align with the type of answer we are looking for.
Results Display -- do you simply return the results or do you try to generate yes/no answers, etc.
See “Finding Answers to Complex Questions” by Diekema, et. al. in New Directions in Question Answering, edited by Mark T. Maybury for backgroundWe will focus in on Candidate Identification in the context of Lucene, but these other things are things to think about in the context of a real QA system
Answer Types:
How many committers are there on the Lucene project? -- Looking for a number
Who are the committers on the Lucene Project? -- List of people
Is Doug Cutting the founder of Lucene? -- yes/no
Document Analysis can be quite in-depth, involving extracting pertinent bits from a document and seeing if they align with the type of answer we are looking for.
Results Display -- do you simply return the results or do you try to generate yes/no answers, etc.
See “Finding Answers to Complex Questions” by Diekema, et. al. in New Directions in Question Answering, edited by Mark T. Maybury for background
19. Candidate Identification for QA A candidate is a potential answer for a question
Can be as short as a word or as large as multiple documents, based on system goals
Example:
Q: How many Lucene properties can be tuned?
Candidate:
Lucene has a number of properties that can be tuned.
20. Span Queries and QA Retrieve candidates using SpanNearQuery, built out of the keywords of the user’s question
Expand to include window larger than span
Score the sections based on system criteria
Keywords, distances, ordering, others
Sample Code in QAService.java in demo See demo code. QAService.getCandidates. In this code we execute the span query, determine the relevant sections by reconstructing the document and then score the results. We use a SpanNearQuery with a fairly large slop factor
See demo code. QAService.getCandidates. In this code we execute the span query, determine the relevant sections by reconstructing the document and then score the results. We use a SpanNearQuery with a fairly large slop factor
21. Example QueryTermVector queryVec = new QueryTermVector(question, analyzer);
String [] terms = queryVec.getTerms();
int [] freqs = queryVec.getTermFrequencies();
SpanQuery [] clauses = new SpanQuery[terms.length];
for (int i = 0; i < terms.length; i++)
{
String term = terms[i];
SpanTermQuery termQuery = new SpanTermQuery(new Term("contents", term));
termQuery.setBoost(freqs[i]);
clauses[i] = termQuery;
}
SpanQuery query = new SpanNearQuery(clauses, 15, false);
Spans spans = query.getSpans(indexReader);
//Now loop through the spans and analyze the documents From QAService.java in the demo
In this example, we take in the question, convert it to a QueryTermVector so that we can build an array of SpanQuerys out of the individual terms.
The SpanNearQuery has a window of 15 positions, which is arbitrary. Depending on goals, may want to change.From QAService.java in the demo
In this example, we take in the question, convert it to a QueryTermVector so that we can build an array of SpanQuerys out of the individual terms.
The SpanNearQuery has a window of 15 positions, which is arbitrary. Depending on goals, may want to change.
22. Tying It All Together To motivate the next slide, Building an IR Framework, provide the context of some of the things we do at CNLP and why we need an IR Framework
Collection Manager – Provides information about the original files, where they are stored, status of different stages of processing, etc.
TextTagger – Provides rich NLP features across the seven levels of linguistic analysis (Phonological, Morphological, Lexical, Syntactic, Semantic, Discourse, and Pragmatic)
Extractor – Provides contextual info, co-occurrence information, etc. for a collection
QA – Provides question answering capabilities
L2L – Language to Logic – converts natural language questions into logical representation
Authoring -- Provides tools for customizing systems for domains
IRTools – discussed in next slideTo motivate the next slide, Building an IR Framework, provide the context of some of the things we do at CNLP and why we need an IR Framework
Collection Manager – Provides information about the original files, where they are stored, status of different stages of processing, etc.
TextTagger – Provides rich NLP features across the seven levels of linguistic analysis (Phonological, Morphological, Lexical, Syntactic, Semantic, Discourse, and Pragmatic)
Extractor – Provides contextual info, co-occurrence information, etc. for a collection
QA – Provides question answering capabilities
L2L – Language to Logic – converts natural language questions into logical representation
Authoring -- Provides tools for customizing systems for domains
IRTools – discussed in next slide
23. Building an IR Framework Goals:
Support rapid experimentation and deployment
Configurable by non-programmers
Pluggable search engines (Lucene, Lucene LM, others)
Indexing and searching
Relevance Feedback (manual and automatic)
Multilingual support
Common search API across projects CNLP is a research-based organization with the need to be able to experiment with many different settings, etc. and to be able to rapidly build system to test out research hypotheses. Furthermore, we also have commercial clients who demand performance (both speed and quality)
Lucene LM – Language Modeling for Lucene -- http://ilps.science.uva.nl/Resources/
Needs to be configurable so Analysts can change properties, re-index and test new hypotheses.
CNLP is a research-based organization with the need to be able to experiment with many different settings, etc. and to be able to rapidly build system to test out research hypotheses. Furthermore, we also have commercial clients who demand performance (both speed and quality)
Lucene LM – Language Modeling for Lucene -- http://ilps.science.uva.nl/Resources/
Needs to be configurable so Analysts can change properties, re-index and test new hypotheses.
24. Lucene IR Tools Write implementations for Interfaces
Indexing
Searching
Analysis
Wrap Filters and Tokenizers, providing Bean properties for configuration
Explanations
Use Digester to create
Uses reflection, but only during startup In order to fit Lucene into IRTools, we had to wrap a few key items to make them easier to configure
During runtime, we take the wrapper and call an appropriate method to instantiate the real Filter or Tokenizer. Wrappers are reusable, underlying TokenStream is notIn order to fit Lucene into IRTools, we had to wrap a few key items to make them easier to configure
During runtime, we take the wrapper and call an appropriate method to instantiate the real Filter or Tokenizer. Wrappers are reusable, underlying TokenStream is not
25. Sample Configuration Declare a Tokenizer:
<tokenizer name="standardTokenizer"
class="StandardTokenizerWrapper"/>
Declare a Token Filter:
<filter name="stop" class="StopFilterWrapper“ ignoreCase="true“ stopFile="stopwords.dat"/>
Declare an Analyzer:
<analyzer class="ConfigurableAnalyzer“>
<name>test</name> <tokenizer>standardTokenizer</tokenizer>
<filter>stop</filter>
</analyzer>
Can also use existing Lucene Analyzers Sample Digester configuration. Class package names have been removed for formatting reasons.
If an analyst wants a different filter, they need just declare it and then include it in the appropriate Analyzer.Sample Digester configuration. Class package names have been removed for formatting reasons.
If an analyst wants a different filter, they need just declare it and then include it in the appropriate Analyzer.
26. Case Studies Domain Specialization
AIR (Arabic Information Retrieval)
Commercial QA
Trouble Ticket Analysis
27. Domain Specialization At CNLP, we often work within specific domains
28. Domain Specialization We provide the SME with tools to tailor our out of the box processing for their domain
Users often want to see NLP features like:
Phrase and entity identification
Phrases and entities in context of other terms
Entity categorizations
Occurrences and co-occurrences
Upon examining, users can choose to alter (or remove) how terms were processed
Benefits
Less ambiguity yields better results for searches and extractions
SME has a better understanding of collection SMEs will often change how Categorizations are applied. For instance, if in their collection, the word bank should always be a categorized as “an area of land” instead of a “Financial Institution”, they could tell our system to so and then re-index and all appropriate instances would be treated as an area of land.SMEs will often change how Categorizations are applied. For instance, if in their collection, the word bank should always be a categorized as “an area of land” instead of a “Financial Institution”, they could tell our system to so and then re-index and all appropriate instances would be treated as an area of land.
29. KBB KBB – Knowledge Base Builder
Uses Lucene to calculate much of the co-reference data. SME inputs a collection which is then run through TextTagger (NLP tool). After it has been processed we use Lucene to index the TextTagger output and to calculate the co-reference dataKBB – Knowledge Base Builder
Uses Lucene to calculate much of the co-reference data. SME inputs a collection which is then run through TextTagger (NLP tool). After it has been processed we use Lucene to index the TextTagger output and to calculate the co-reference data
30. Lucene and Domains Build an index of NLP features
Several Alternatives for domain analysis:
Store Term Vectors (including position and offset) and loop through selected documents
Custom Analyzer/filter produces Lucene tokens and calculates counts, etc.
Span Queries for certain features such as term co-occurrence Index can contain mappings like:
Term -> co-occurrences w/ other terms
Part Of Speech -> terms
Categorizations -> phrases
Index can contain mappings like:
Term -> co-occurrences w/ other terms
Part Of Speech -> terms
Categorizations -> phrases
31. Arabic Info. Retrieval(AIR)
32. AIR and Lucene AIR uses IR Tools
Wrote several different filters for Arabic and English to provide stemming and phrase identification
One index per language
Do query analysis on both the Source and Target language
Easily extendable to other languages Wrote several Arabic Stemmers to experiment with both light and aggressive stemmers as well as for looking up whether a token is a phrase
One Index Per Language
Often on the mailing list there are questions about whether to store everything in one index or in multiple. We chose multiple because IRTools makes the simplifying assumption that each index is monolingual, which allows for easier managementWrote several Arabic Stemmers to experiment with both light and aggressive stemmers as well as for looking up whether a token is a phrase
One Index Per Language
Often on the mailing list there are questions about whether to store everything in one index or in multiple. We chose multiple because IRTools makes the simplifying assumption that each index is monolingual, which allows for easier management
34. QA for CRM Provide NLP based QA for use in CRM applications
Index knowledge based documents, plus FAQs and PAQs
Uses Lucene to identify passages which are then processed by QA system to determine answers
Index contains rich NLP features
Set boosts on Query according to importance of term in query. Boosts determined by our query analysis system (L2L)
We use the score from Lucene as a factor in scoring answers
L2L
http://www.cnlp.org/cnlp.asp?m=4&sm=24
L2L
http://www.cnlp.org/cnlp.asp?m=4&sm=24
35. Trouble Ticket Analysis Client has Trouble Tickets (TT) which are a combination of:
Fixed domain fields such as checkboxes
Free text, often written using shorthand
Objective:
Use NLP to discover patterns and trends across TT collection enabling better categorization
36. Using Lucene for Trouble Tickets Write custom Filters and Tokenizers to process shorthand
Tokenizer:
Certain Symbols and whitespace delineate tokens
Filters
Standardize shorthand
Addresses, dates, equipment and structure tags
Regular expression stopword removal
Use high frequency terms to identify lexical clues for use in trend identification by TextTagger
Browsing collection to gain an understanding of domain
37. Resources Demos available
http://lucene.apache.org
Mailing List
java-user@lucene.apache.org
CNLP
http://www.cnlp.org
http://www.cnlp.org/apachecon2005
Lucene In Action
http://www.lucenebook.com Special thanks to:
Ozgur Yilmazel, Niranjan Balasubramanian, Steve Rowe and Svetlana Symonenko from CNLP!Special thanks to:
Ozgur Yilmazel, Niranjan Balasubramanian, Steve Rowe and Svetlana Symonenko from CNLP!