320 likes | 459 Views
Exploring the Performance of Boolean Retrieval Strategies for Open Domain Question Answering. IR4QA Information Retrieval for Question Answering SIGIR 2004 Workshop Horacio Saggion, Robert Gaizauskas, Mark Hepple, Ian Roberts, Mark A. Greenwood (University of Sheffield, UK). Abstract.
E N D
Exploring the Performance of Boolean Retrieval Strategies for Open DomainQuestion Answering IR4QA Information Retrieval for Question Answering SIGIR 2004 Workshop Horacio Saggion, Robert Gaizauskas, Mark Hepple, Ian Roberts, Mark A. Greenwood (University of Sheffield, UK)
Abstract • Exploring and evaluating Boolean retrieval strategies for QA • Evaluation metrics: coverage and redundancy • A series of possible Boolean retrieval strategies for use in QA. • Evaluate their performances. • Understanding what query formulations are better for QA.
Introduction (1 of 2) • Open domain question answering • QA’s performance bound by IR system. Question IR Answer Extraction Answer
Introduction (2 of 2) • Present result by ranked retrieval engines “off-the-shelf” (baseline: No AE) • Describe previous work has used a Boolean search strategy in QA. • Experiments • Results • Discussion
Evaluation Measures (1 of 4) • Coverage • The proportion of the question set for which a correct answer can be found within the top n passages retrieved for each question. • Answer Redundancy • The average number of passages within the top n ranks retrieved which contain a correct answer for per question. • How many chances per question on average an answer extraction component has to find an answer.
Evaluation Measures (2 of 4) n: top n ranks A D,q: the subset of D which contains correct answers for q belong to Q RSD,q,n: the n top-ranked documents (or passages) in D retrieved by a retrieval system S given question q
Evaluation Measures (3 of 4) Corpus RSD,q,n Without Answer With Answer A D,q
Evaluation Measures (4 of 4) • The actual redundancy is the upper bound on the answer redundancy achievable by any QA system. • Comparing answer redundancy with actual redundancy captures the same information that recall traditionally supplies.
Evaluation Dataset (1 of 3) • Text collection is the AQUAINT corpus 1,000,000 documents from newswire 1998-2000, 3.2G • TREC2003 question set: 500 questions only use 413 factoid questions, list or definition question. • 51 questions which have no answer were judged by human. • Our question set consists of 362 questions by excluding these questions. • NIST produces the patterns of regular expression for each question which matches strings that contain the answer.
Evaluation Dataset (2 of 3) • There are two criterions for correctness. • Lenient • Any string drawn from the test collection which matches an answer pattern for a question. • Strict • A string matches an answer pattern is drawn from a text which has been judged by a human assessor to support the answer.
Evaluation Dataset (3 of 3) • We have estimated the actual redundancy of this text collection + question set to be 3.68, based on taking the average number of texts per question judged by human assessors to support an answer. • Some supporting documents may contain more than one occurrence of the answer. • Not every document supports an answer is likely to have been found by the assessors.
Okapi • The state-of-the-art in ranked retrieval. • All passages are done at search-time, not at index time. • Base on paragraph boundaries that have the length of 4 sentences.
Lucene • Open-source IR engine, Boolean query, ranked retrieval, standard tf.idf weighting, cosine similarity measure. • Average paragraph length is about 1.5 sentences by splitting corpus into passages. • Remove stopwords • Stemmed using the Porter stemmer. • Queries consist of all the question words.
Z-PRISE • Vector space retrieval system freely available from the National Institute of Standards and Technology (NIST) • The average sentences for each document that hasn’t split into passage are 24. • Any rank in Z-PRISE is greater amount of text. • Coverage and redundancy should better than Okapi and Lucene. • It may bring a risk of lower performance.
Boolean retrieval for QA • We can simply take the words of the question as a query with ranked retrieval. • Get ranked answer candidate passages. • If terms doesn’t appear together in any passage of the entire collection, we try to • ‘Weaken’ the query. Weaken means delete terms • Generalize • Dynamic
MURAX • Kupiec’s MURAX. Knowledge base: Grolier’s on-line encyclopedia. • Analyses the question to locate noun phrase and main verbs, and form a query. • Passages return, new query create to reduce the number of hits (narrowing) or increase term (broadening) • Ranked by overlap with the initial question.
Falcon • Uses the SMART retrieval engine. • Initial query is formulated by using keywords from the question. • Query may join a term: w1 -> w1 or w2 or w3 • Morphological (invent, inventor, invented) • Lexical (assassin, killer) • Semantic (prefer, like)
Sheffield • In-house Boolean search engine MadCow. • Window size for both matching. • Query formulation: name expression. • Bill Clinton: (Bill&Clinton) | (President&Clinton) • If it fails or returns too many passage. • Extend an overly weak name condition • In place of any name condition
Experiments (1 of 8) • Understanding of query formulation. • Question analysis • Term expansion • Query broadening • Passage and matching-window size • Ranking
Experiments (2 of 8) • Minimal strategy (lower bound) • Simplest approach, AllTerms: Use conjunction of the question term to formulate query. • Ex: How far is it from Earth to Mars? (Mars & Earth) • Q and P represent the set of non-stoplist term in the question and passage. • 178 of 362 questions return non-empty result.
Experiments (3 of 8) • Simple Term Expansion • WordNet configuration explores by using synonymy expansion keyword-barrier. • When Allterm has no matching, disjunction of its synonymy. • 202 of 362 have at least one matching.
Experiments (4 of 8) • MorphVar configuration explores the use of morphological variants of query terms. • Returning the same stem string when the Porter stemmer is applied to the corpus. • When Allterm has no matching, disjunction of its morphological variants. • 203 of 362 have at least one matching.
Experiments (5 of 8) • Document Frequency Broadening • DropBig configuration discards from the initial query the question term having highest document frequency. (273 of 362) • DropSmall configuration discards from the initial query the question term having lowest document frequency. (288 of 362) • Iterative deletion (least) prefers highest and lowest frequency term • DropBig -> BigIte and DropSmall ->SmallIte
Experiments (6 of 8) • Structure Analysis: StrIte • Distinguishes proper name and quoted expression from other terms in a question. • POS tagging to identify proper nouns. • What is Richie’s surname on “Happy Days”? • Name term: Richie • Quote term: Happy Days • Common term: What is ‘s surname on
Experiments (7 of 8) • AllTerm -> StrIte: iteratively drops terms until at least one matching sentence is returned. • Drop order: • Common term • Name term • Quote term
Experiments (8 of 8) • StrIte -> StrIteMorph -> StrIteMorph20 • StrIteMorph: from StrIte, where each term is expanded with its morphological. • StrIteMorph20: from StrIteMorph, until at least 20 sentences per question are retrieved. • These 3 configurations, w(t) of 1/6 for common terms, 2/6 for name terms, 3/6 for quote terms.
Result (1 of 2) Mean number of sentences an answer extraction system would have to process for rank k.
Discussion (1 of 2) • At rank 200 StrIteMorph20 achieves 62.15% coverage, as compared to 72.9% for Lucene, 78.2% for Okapi, and 80.4% for Z-PRISE. • At rank 200 StrItmMorph20 return on average around 137 sentences, Luncene around 300, Okapi around 800, and Z-PRISE aound 4600.
Discussion (2 of 2) • Downstream AE component should avoid distraction in larger text volume. • Synonyms (WordNet) offer negligible advantage. • Expanding all term with morphological (MorphVar), doesn’t offer a major improvement.
Future work • The post-retrieval ranking of result needs to be explored in more detail. Or other ranking method should be explored. • The most effective window size. • Query refinement. • Term expansion method.