An Analysis of the AskMSR Question-Answering System

An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko Microsoft Research

From Proceedings of the EMNLP Conference, 2002

Goals • Evaluate contributions of components • Explore strategies for predicting when answers are incorrect

AskMSR – What Sets It Apart • Dependency on data redundancy • No sophisticated linguistic analyses • Of questions • Of answers

TREC Question Answering Track • Fact-based, short-answer questions • How many calories are there in a Big Mac? • Who killed Abraham Lincoln? • How tall is Mount Everest? • 562 – In case you’re wondering • Motivation for much of recent work in QA

Other Approaches • POS tagging • Parsing • Named Entity extraction • Semantic relations • Dictionaries • WordNet

AskMSR Approach • Web – “gigantic data repository” • Different from other systems using web • Simplicity & efficiency • No complex parsing • No entity extraction • For queries or best matching web pages • No local caching • Claim: techniques used in approach to short-answer tasks are more broadly applicable

Some QA Difficulties • Single, small information source • Likely only 1 answer exists • Source with small # of answer formulations • Complex relations between Q & A • Lexical, syntactic, semantic relations • Anaphora, synonymy, alternate syntactic formulations, indirect answers make this difficult

Answer Redundancy • Greater answer redundancy in source • More likely: simple relation between Q & A exists • Less likely: need to deal with difficulties facing NLP systems

System Architecture

Query Reformulation • Rewrite question • Substring of declarative answer • Weighted • “when was the paper clip invented?”  “the paper clip was invented” • Produce less precise rewrites • Greater chance of matching • Backoff to simple ANDing of non-stop words

Query Reformulation (cont.) • String based manipulations • No parser • No POS tagging • Small lexicon for possible POS and morphological variants • Created rewrite rules by hand • Chose associated weights by hand

N-gram Mining • Formulate rewrite for search engine • Collect and analyze page summaries • Why use summaries? • Efficiency • Contain search terms, plus some context • N-grams collected from summaries

N-gram Mining (Cont.) • Extract 1-, 2-, 3-grams from summary • Score by weight of rewrite that retrieved it • Sum scores across all summaries with n-gram • No frequency within summary • Final score for n-gram • Weights associated with rewrite rules • # of unique summaries it is in

N-gram Filtering • Use handwritten filter rules • Question type assignment • e.g. who, what, how • Choose set of filters based on q-type • Rescore n-grams based on presence of features relevant to filters

N-gram Filtering (Cont.) • 15 simple filters • Based on human knowledge • Question types • Answer domain • Surface string features • Capitalization • Digits • Handcrafted regular expression patterns

N-gram Tiling • Merge similar answers • Create longer answers from overlapping smaller answer fragments • “A B C”, “B C D”  “A B C D” • Greedy algorithm • Start w/ top-scoring n-gram, check lower scoring n-grams for tiling potential • If can be tiled, replace higher-scoring n-gram with tiled n-gram, remove lower-scoring n-gram • Stop when can no longer tile

Experiments • First 500 TREC-9 queries • Use scoring patterns provided by NIST • Modified some patterns to accommodate web answers not in TREC • More specific answers allowed • Edward J. Smith vs. Edward Smith • More general answers not allowed • Smith vs. Edward Smith • Simple substitutions allowed • 9 months vs. nine months

Experiments (cont.) • Time differences between Web & TREC • “Who is the president of Bolivia?” • Did NOT modify answer key • Would make comparison w/earlier TREC results impossible (instead of difficult?) • Changes influence absolute scores, not relative performance

Experiments (cont.) • Automatic runs • Start w/queries • Generate ranked list of 5 answers • Use Google as search engine • Query-relevant summaries for n-gram mining efficiency • Answers are max. of 50 bytes long • Typically shorter

“Basic” System Performance • Backwards notion of basic • Current system, all modules implemented • Default settings • Mean Reciprocal Rank (MRR) – 0.507 • 61% of questions answered correctly • Average answer length – 12 bytes • Impossible to compare precisely with TREC-9 groups, but still very good performance

Component Contributions

Query Rewrite Contribution • More precise queries – higher weights • All rewrites equal – MRR drops 3.6% • Only backoff AND – MRR drops 11.2% • Rewrites capitalize on web redundancy • Could use more specific regular expression matching

N-gram Filtering Contribution • 1-, 2-, 3-grams from 100 best-matching summaries • Filter by question type • “How many dogs pull a sled in the Iditarod?” • Question prefers a number • Run, Alaskan, dog racing, many mush ranked lower than pool of 16 dogs (correct answer) • No filtering – MRR drops 17.9%

N-gram Tiling Contribution • Benefits of tiling • Substrings take up only 1 answer slot • e.g. San, Francisco, San Francisco • Longer answers can never be found with only tri-grams • e.g. “light amplification by [stimulated] emission of radiation” • No tiling – MRR drops 14.2%

Component Combinations • Only weighted sum of occurrences of1-, 2-, 3-grams – MRR drops 47.5% • Simple statistical system • No linguistic knowledge or processing • Only AND queries • Filtering – no, (statistical) tiling – yes • MRR drops 33% to 0.338

Component Combinations • Statistical system –good performance? • Reasonable on absolute scale? • One TREC-9 50 byte run performed better • All components contribute to accuracy • Precise weights of rewrites unimportant • N-gram tiling – a “poor man’s named-entity recognizer” • Biggest contribution from filters/selection

Component Combinations • Claim: “Because of the effectiveness of our tiling algorithm…we do not need to use any named entity recognition components.” • By having filters with capitalization info (section 2.3, 2ndparagraph), aren’t they doing some NE recognition?

Component Problems

Component Problems (cont.) • No correct answer in top 5 hypotheses • 23% of errors – not knowing units • How fast can Bill’s Corvette go? mph or k/h • 34% (Time, Correct) – time problems or answer not in TREC-9 answer key • 16% from shortcomings in n-gram tiling • Number retrieval (5%) – query limitation

Component Problems (cont.) • 12% - beyond current system paradigm • Can’t be fixed with minor enhancements • Is this really so? or have they been easy on themselves in error attribution? • 9% - no discussion

Knowing When… • Some cost for answering incorrectly • System can choose to not answer instead of giving incorrect answer • How likely hypothesis is correct? • TREC – no distinction between wrong answer and no answer • Deploy real system – trade-off between precision & recall

Knowing When…(cont.) • Answer is ad-hoc combination of hand tuned weights • Is it possible to induce useful precision-recall (ROC) curve when answers don’t have meaningful probabilities? • What is an ROC (Receiver Operating Characteristic) curve?

ROC • From http://www-csli.stanford.edu/~schuetze/roc.html (Hinrich Schütze, co-author of Foundations of Statistical Natural Language Processing)

ROC (cont.)

Determining Likelihood • Ideal – determine likelihood of correct answer based only on question • If possible, can skip such questions • Use decision tree based on set of features from question string • 1-, 2-grams, type • sentence length, longest word length • # capitalized words, # stop words • Ratio of stop words to non-stop words

Decision Tree/Diagnostic Tool • Performs worst on how questions • Performs best on short who questions w/many stop words • Induce ROC curve from decision tree • Sort leaf nodes from highest probability of being correct to lowest • Gain precision by not answering questions with highest probability of error

Decision Tree–Query

Decision Tree–Query Results • Decision Tree trained on TREC-9 • Tested on TREC-10 • Overfits training data – insufficient generalization

Decision Tree–Query Training

Decision Tree–Query Test

Answer Correctness/Score • Ad-hoc score based on • # of retrieved passages n-gram occurs in • weight of rewrite used to retrieve passage • what filters apply • effects of n-gram tiling • Correlation between whether answer appears in top 5 output and…

Correct Answer In Top 5 • …and score of system’s first ranked answer • Correlation coefficient: 0.363 • No time-sensitive q’s: 0.401 • …and score of first ranked answer minus second • Correlation coefficient: 0.270

Answer #1 Score - Train

Answer #1 Score – Test

Other Likelihood Indicators • Snippets gathered for each question • AND queries • More refined exact string match rewrites • MRR and snippets • All snippets from AND: 0.238 • 11 to 100 from non-AND: 0.612 • 100 to 400 from non-AND: 0.628 • But wasn’t MRR for “base” system 0.507?

Another Decision Tree • Features of first DT, plus • Score of #1 answer • State of system in processing • Total # of matching passages • # of non-AND matching passages • Filters applied • Weight of best rewrite rule yielding matching passages • Others

Decision Tree–All features

Decision Tree–All Train

Decision Tree–All Test

An Analysis of the AskMSR Question-Answering System