400 likes | 571 Views
Day 14. Information Retrieval Question Answering. TREC. TREC – Text REtrieval Conference Administered by the National Institute of Standards (NIST) Annual competition held annually since 1992.
Day 14 Information Retrieval Question Answering
TREC • TREC – Text REtrieval Conference • Administered by the National Institute of Standards (NIST) • Annual competition held annually since 1992. • First conference included leading text retrieval groups at UMass, City University London, Cornell, and a smattering of industry groups.
The Aquaint Corpus • The corpus used by TREC from which answers are drawn. • LDC2002T31 (on patas) • Newswire from three sources: • Xinhua News Service (People's Republic of China) • New York Times News Service • Associated Press Worldstream News Service • Not current: years 1996-2000 for Xinhua, 1998-2000 for NYT and AP • For TREC competition: assumed current
TREC QA track • Three types of questions in TREC QA track: • Factoid • List • Other • All clustered into topics
Question Answering (QA) • Uses IR and IE techniques (and more…) • Questions posed in Natural Language • Who was Genghis Khan? • What songs did Barry Manilow compose? • What countries fly the F-16? • When was James Dean born? • What does Park Jae-sang sing? • Answers retrieved from a collection of documents (or a database, or the Web)
Designing a QA System • Start with a question: • Who won the Nobel Peace Prize in 1991? • Assume you have a search engine API at your disposal • Need to return the answer: • Aung San SuuKyi • Aung San SuuKyi won the Nobel Peace Prize in 1991 • What do you do?
Designing a QA System • Assume: • The search engine returns • Snippets, and, • Documents
Designing a QA System • Assume: • The search engine returns • Snippets, and, • Documents • Documents • Are in English • Contain passages of interest • Not all documents will have the answer
Designing a QA System Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San SuuKyi, the opposition leader who won the Nobel Peace Prize in 1991. The military junta took power in 1988 as pro-democracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San SuuKyi - leader of the opposition party which won a landslide victory in the poll - under house arrest since July 1989. The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San SuuKyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day.
Designing a QA System • Assume: • The search engine returns • Snippets, and, • Documents • Documents • Are in English • Contain passages of interest • Not all documents will have the answer • The sky’s the limit wrt tools, resources, time, etc.
Designing a QA System Who won the Nobel Peace Prize in 1991?
Designing a QA System Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San SuuKyi, the opposition leader who won the Nobel Peace Prize in 1991. The military junta took power in 1988 as pro-democracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San SuuKyi - leader of the opposition party which won a landslide victory in the poll - under house arrest since July 1989. The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San SuuKyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day.
An Example Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San Suu Kyi, the opposition leader who won the Nobel Peace Prize in 1991. The military junta took power in 1988 as pro-democracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San Suu Kyi - leader of the opposition party which won a landslide victory in the poll - under house arrest since July 1989. The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San Suu Kyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day.
Question Answering (QA) • For a QA system to work, we need to • Find documents that may contain the answer • Form search engine query from original question • Find passages within the documents that may contain the answer • What is “type” of answer? • Determine what kind of answer is expected (query classification) • Extract the answer from the relevant passage(s) • Repeated occurrences may reinforce • Return the answer
A Generic QA Framework • Passage extractor needed too
Steps for the UWCLMAQA System • Query Analysis • Query Processing (some additional steps) • Document Selection • Passage Extraction & Ranking • Answer Extraction • “Unit” evaluation done at each step
UIUC: http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/definition.html Query Analysis • Grouped questions into types • Purpose: Determine what the answer will look like • Categorized by enhanced UIUC scheme: • Abbreviation • Description • Entity • Human • Location – Country, State, City • Numeric – Date, Measure
Alternative Strategy: Query Analysis and Rewrite • Intuition: The user’s question is often syntactically quite close to sentences that contain the answer • Where istheLouvreMuseumlocated? • TheLouvreMuseumislocated in Paris • Who createdthecharacterofScrooge? • Charles DickenscreatedthecharacterofScrooge.
Alternative Strategy:Query Analysis and Rewrite • Hand-craft category-specific transformation rules e.g.: “Where is the Louvre Museum located?” “is the Louvre Museum located” “the is Louvre Museum located” “the Louvre is Museum located” “the Louvre Museum is located” “the Louvre Museum located is” • Search for all permutations
Query Processing • Basic process: • Extracted question • Appended topic • “Web boosted” query • Threw against Lucene
Query Processing • Web boosting strategy • Supplied question and topic to Google API • Results were • Stop-worded, query terms removed • Ranked by frequency • 5 most frequent terms added to Lucene query
Document Selection • Lucene returned top 1,000 documents • Took top 3 for Factoid, Top 25 for List • (Hook for reranking provided, but not implemented.) • Our doc retrieval performance for 2005 Qs: • F-measure - .3517 n=3, .3620 n=1 • Mean 2005: .2958 • Max (LCC) 2005: .7920
Passage Extraction & Ranking • From top documents, extracted relevant paragraphs • Paragraphs ranked by tf/idf: • tf = 1+log(word frequency in paragraph) • idf = log(total doc count/# docs containing word) • total doc count = # docs by day by news source • tf/idf score normalized by paragraph length
Passage Ranking • tf/idf multiplied by count of query terms in paragraph (giving them more weight) • 10 paragraphs returned for factoids • 45 paragraphs returned for lists
Passage Extraction & Ranking Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San SuuKyi, the opposition leader who won the Nobel Peace Prize in 1991. The military junta took power in 1988 as pro-democracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San SuuKyi - leader of the opposition party which won a landslide victory in the poll - under house arrest since July 1989. The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San SuuKyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day.
Answer Extraction • Most factoids need NP answer (e.g., most are NEs, such as countries, cities, dates, people’s names, company names, …) • All NPs considered as possible answers • For passages • Used Lingua::Stem to find sentences (sentence breaking) • POS tagged (Stanford POS Tagger) • Chunked using the fnTBLChunker (ID NP-chunks) • Prior query classification used to identify kind of NP answer expected • Some other heuristics (e.g., most likely place NP would occur)
An Example Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San Suu Kyi, the opposition leader who won the Nobel Peace Prize in 1991. The military junta took power in 1988 as pro-democracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San Suu Kyi - leader of the opposition party which won a landslide victory in the poll - under house arrest since July 1989. The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San Suu Kyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day.
Answer Extraction • For lists: • Question topic appeared the most important • Heavily weighted topic terms for Lucene • Similar process to Factoids (tagging, chunking) for finding answers • Cut-off determined by 2005 data
Answer Extraction • For others: • Anything left over that might be answer bearing • Top 15 returned
How’d we do? • Before answering the question: • Mean Reciprocal Rank (MRR)
Mean Reciprocal Rank (MRR) • Assumes: test set of questions with human-labeled answers • Assumes: system returns short ranked list of answers or passages with answers • Answers scored with the sum of the reciprocal rank of the correct answers over total returned answers (for N questions)
How’d we do? • Factoid • UWCLMAQA: .112 and .109 • Median: .186, Best: .578, Worst: .040 • List • UWCLMAQA: .051 and .046 • Median: .087, Best: .433, Worst: .000 • Other • UWCLMAQA: .164 and .153 • Median: .125, Best: .250, Worst: .000
Full List of Tools Used • SGML::Parser::OpenSP http://search.cpan.org/~bjoern/SGML-Parser-OpenSP-0.98/ • OpenSP http://openjade.sourceforge.net/ • UIUC Question Classification http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/ • Lucene http://lucene.apache.org/java/docs/index.html • SAX (Simple API for XML) http://www.saxproject.org/ • Maxent Toolkit http://homepages.inf.ed.ac.uk/s0450736/maxent_toolkit.html • PyGoogle http://pygoogle.sourceforge.net/ • SOAPy http://soapy.sourceforge.net/ • Google API http://www.google.com/apis/ • Lingua::Stem http://search.cpan.org/~snowhare/Lingua-Stem-0.82/lib/Lingua/Stem/En.pm • Lingua::Sentence http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm • Stanford POS Tagger http://nlp.stanford.edu/software/tagger.shtml • fnTBLChunker http://nlp.cs.jhu.edu/~rflorian/fntbl/ • Lingpipe http://www.alias-i.com/lingpipe/ • LevenshteinXS.pm http://search.cpan.org/~jgoldberg/Text-LevenshteinXS-0.03/