650 likes | 814 Views
Question Answering Techniques and Systems. TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya surdeanu@lsi.upc.es. Mihai Surdeanu (TALP) Marius Paşca (Google - Research) *.
E N D
Question Answering Techniques and Systems TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya surdeanu@lsi.upc.es Mihai Surdeanu (TALP) Marius Paşca (Google - Research)* *The work by Marius Pasca (currently mars@google.com) was performed as part of his PhD work at Southern Methodist University in Dallas, Texas.
Overview • What is Question Answering? • A “traditional” system • Other relevant approaches • Distributed Question Answering
Problem of Question Answering When was the San Francisco fire? … were driven over it. After the ceremonial tie was removed - it burned in the San Francisco fire of 1906 – historians believe an unknown Chinese worker probably drove the last steel spike into a wooden tie. If so, it was only… What is the nationality of Pope John Paul II? … stabilize the country with its help, the Catholic hierarchy stoutly held out for pluralism, in large part at the urging of Polish-born Pope John Paul II. When the Pope emphatically defended the Solidarity trade union during a 1987 tour of the… Where is the Taj Mahal? … list of more than 360 cities around the world includes the Great Reef in Australia, the Taj Mahal in India, Chartre’s Cathedral in France, and Serengeti National Park in Tanzania. The four sites Japan has listed include…
Problem of Question Answering Natural language question, not keyword queries What is the nationality of Pope John Paul II? … stabilize the country with its help, the Catholic hierarchy stoutly held out for pluralism, in large part at the urging of Polish-born Pope John Paul II. When the Pope emphatically defended the Solidarity trade union during a 1987 tour of the… Short text fragment, not URL list
From the Caledonian Star in the Mediterranean – September 23, 1990 (www.expeditions.com): On a beautiful early morning the Caledonian Star approaches Naxos, situated on the east coast of Sicily. As we anchored and put the Zodiacs into the sea we enjoyed the great scenery. Under Mount Etna, the highest volcano in Europe, perches the fabulous town of Taormina. This is the goal for our morning. After a short Zodiac ride we embarked our buses with local guides and went up into the hills to reach the town of Taormina. Naxos was the first Greek settlement at Sicily. Soon a harbor was established but the town was later destroyed by invaders.[...] Compare with… Document collection Searching for: Etna Where is Naxos? Searching for: Naxos What continent is Taormina in? What is the highest volcano in Europe? Searching for: Taormina
Beyond Document Retrieval • Document Retrieval • Users submit queries corresponding to their information needs. • System returns (voluminous) list of full-length documents. • It is the responsibility of the users to find information of interest within the returned documents. • Open-Domain Question Answering (QA) • Users ask questions in natural language. What is the highest volcano in Europe? • System returns list of short answers. … Under Mount Etna, the highest volcano in Europe, perches the fabulous town… • Often more useful for specific information needs.
Evaluating QA Systems • National Institute of Standards and Technology (NIST) organizes yearly the Text Retrieval Conference (TREC), which has had a QA track for the past 5 years: from TREC-8 in 1999 to TREC-12 in 2003. • The document set • Newswire textual documents from LA Times, San Jose Mercury News, Wall Street Journal, NY Times etcetera: over 1M documents now. • Well-formed lexically, syntactically and semantically (were reviewed by professional editors). • The questions • Hundreds of new questions every year, the total is close to 2000 for all TRECs. • Task • Initially extract at most 5 answers: long (250B) and short (50B). • Now extract only one exact answer. • Several other sub-tasks added later: definition, list, context. • Metrics • Mean Reciprocal Rank (MRR): each question assigned the reciprocal rank of the first correct answer. If correct answer at position k, the score is 1/k.
Overview • What is Question Answering? • A “traditional” system • SMU ranked first at TREC-8 and TREC-9 • The foundation of LCC’s PowerAnswer system (http://www.languagecomputer.com) • Other relevant approaches • Distributed Question Answering
Extracts and ranks passages using surface-text techniques Captures the semantics of the question Selects keywords for PR Extracts and ranks answers using NL techniques QA Block Architecture Question Semantics Passage Retrieval Answer Extraction Q Question Processing A Passages Keywords WordNet WordNet Document Retrieval Parser Parser NER NER
Question Processing Flow Question semantic representation Construction of the question representation Q Question parsing Answer type detection AT category Keyword selection Keywords
Lexical Terms Examples • Questions approximated by sets of unrelated words (lexical terms) • Similar to bag-of-word IR models
Question Stems and Answer Type Examples • Identify the semantic category of expected answers • Other question stems: Who, Which, Name, How hot... • Other answer types: Country, Number, Product...
Building the Question Representation from the question parse tree, bottom-up traversal with a set of propagation rules Q006: Why did David Koresh ask the FBI for a word processor? SBARQ SQ VP PP WHADVPNPNPNP WRB VBD NNPNNPVB DT NNP IN DT NNNN Why did DavidKoreshask the FBI for a wordprocessor [published in COLING 2000] • - assign labels to non-skip leaf nodes • propagate label of head child node, to parent node • link head child node to other children nodes
ask ask ask processor REASON Koresh processor FBI Building the Question Representation from the question parse tree, bottom-up traversal with a set of propagation rules Q006: Why did David Koresh ask the FBI for a word processor? SBARQ SQ VP PP WHADVPNPNPNP WRB VBD NNPNNPVB DT NNP IN DT NNNN Why did DavidKoreshask the FBI for a wordprocessor Koresh FBI ask David Question representation REASON processor word
Detecting the Expected Answer Type • In some cases, the question stem is sufficient to indicate the answer type (AT) • Why REASON • When DATE • In many cases, the question stem is ambiguous • Examples • What was the name of Titanic’s captain ? • What U.S. Government agency registers trademarks? • What is the capital of Kosovo? • Solution: select additional question concepts (AT words) that help disambiguate the expected answer type • Examples • captain • agency • capital
AT Detection Algorithm • Select the answer type word from the question representation. • Select the word(s) connected to the question. Some content-free words are skipped (e.g. “name”). • From the previous set select the word with the highest connectivity in the question representation. • Map the AT word in a previously built AT hierarchy • The AT hierarchy is based on WordNet, with some concepts associated with semantic categories, e.g. “writer” PERSON. • Select the AT(s) from the first hypernym(s) associated with a semantic category.
P ERSON scientist, man of science inhabitant, dweller, denizen performer, performing artist researcher chemist oceanographer dancer American actor westerner islander, island-dweller ballet dancer tragedian actress name researcher What oceanographer What discovered French Hepatitis-B owned vaccine Calypso What researcher discovered the vaccine against Hepatitis-B? What is the name of the French oceanographer who owned Calypso? Answer Type Hierarchy PERSON PERSON
Evaluation of Answer Type Hierarchy • Controlled variation of the number of WordNet synsets included in answer type hierarchy. • Test on 800 TREC questions. Hierarchy coverage Precision score (50-byte answers) 0% 0.296 3% 0.404 10% 0.437 25% 0.451 50% 0.461 • The derivation of the answer type is the main source of unrecoverable errors in the QA system
Keyword Selection • AT indicates what the question is looking for, but provides insufficient context to locate the answer in very large document collection • Lexical terms (keywords) from the question, possibly expanded with lexical/semantic variations provide the required context
Keyword Selection Algorithm • Select all non-stop words in quotations • Select all NNP words in recognized named entities • Select all complex nominals with their adjectival modifiers • Select all other complex nominals • Select all nouns with adjectival modifiers • Select all other nouns • Select all verbs • Select the AT word (which was skipped in all previous steps)
Keyword Selection Examples • What researcher discovered the vaccine against Hepatitis-B? • Hepatitis-B, vaccine, discover, researcher • What is the name of the French oceanographer who owned Calypso? • Calypso, French, own, oceanographer • What U.S. government agency registers trademarks? • U.S., government, trademarks, register, agency • What is the capital of Kosovo? • Kosovo, capital
Extracts and ranks passages using surface-text techniques Captures the semantics of the question Selects keywords for PR Extracts and ranks answers using NL techniques Passage Retrieval Question Semantics Passage Retrieval Answer Extraction Q Question Processing A Passages Keywords WordNet WordNet Document Retrieval Parser Parser NER NER
Passage Retrieval Architecture Passage Quality Keywords Yes Keyword Adjustment No Passage Scoring Passage Ordering Passages Ranked Passages Passage Extraction Documents Document Retrieval
Passage Extraction Loop • Passage Extraction Component • Extracts passages that contain all selected keywords • Passage size dynamic • Start position dynamic • Passage quality and keyword adjustment • In the first iteration use the first 6 keyword selection heuristics • If the number of passages is lower than a threshold query is too strict drop a keyword • If the number of passages is higher than a threshold query is too relaxed add a keyword
Passage Scoring (1/2) • Passages are scored based on keyword windows • For example, if a question has a set of keywords: {k1, k2, k3, k4}, and in a passage k1 and k2 are matched twice, k3 is matched once, and k4 is not matched, the following windows are built: Window 1 Window 2 k1k2 k3 k2 k1 k1 k2 k3 k2 k1 Window 3 Window 4 k1 k2 k3 k2 k1 k1 k2 k3 k2 k1
Passage Scoring (2/2) • Passage ordering is performed using a radix sort that involves three scores: largest SameWordSequenceScore, largest DistanceScore, smallest MissingKeywordScore. • SameWordSequenceScore • Computes the number of words from the question that are recognized in the same sequence in the window • DistanceScore • The number of words that separate the most distant keywords in the window • MissingKeywordScore • The number of unmatched keywords in the window
Extracts and ranks passages using surface-text techniques Captures the semantics of the question Selects keywords for PR Extracts and ranks answers using NL techniques Answer Extraction Question Semantics Passage Retrieval Answer Extraction Q Question Processing A Passages Keywords WordNet WordNet Document Retrieval Parser Parser NER NER
Answer ranking scheme ranking features Ranking Candidate Answers Q066: Name the first private citizen to fly in space. • Answer type: Person • Text passage: “Among them was Christa McAuliffe, the first private citizen to fly in space. Karen Allen, best known for her starring role in “Raiders of the Lost Ark”, plays McAuliffe. Brian Kerwin is featured as shuttle pilot Mike Smith...” • Best candidate answer: Christa McAuliffe
Features for Answer Ranking • relNMW– number of question terms matched in the answer passage • relSP– number of question terms matched in the same phrase as the candidate answer • relSS– number of question terms matched in the same sentence as the candidate answer • relFP– flag set to 1 if the candidate answer is followed by a punctuation sign • relOCTW– number of question terms matched, separated from the candidate answer by at most three words and one comma • relSWS– number of terms occurring in the same order in the answer passage as in the question • relDTW– average distance from candidate answer to question term matches Robust heuristics that work on unrestricted text!
Answer Ranking based on Machine Learning • Relative relevance score computed for each pair of candidates (answer windows) relPAIR = wSWS relSWS + wFP relFP + wOCTW relOCTW + wSP relSP + wSS relSS + wNMW relNMW + wDTW relDTW +threshold • if relPAIR positive, then first candidate from pair is more relevant • Perceptron model used to learn the weights • published in SIGIR 2001 • Scores in the 50% MRR for short answers, in the 60% MRR for long answers
Evaluation on the Web • test on 350 questions from TREC (Q250-Q600) • extract 250-byte answers
System Extension:Answer Justification • Experiments with Open-Domain Textual Question Answering. Sanda Harabagiu, Marius Paşca and Steve Maiorano. • Answer justification using unnamed relations extracted from the question representation and the answer representation (constructed through a similar process).
System Extension:Definition Questions • Definition questions ask about the definition or description of a concept: • Who is John Galt? • What is anorexia nervosa? • Many “information nuggets” are acceptable answers • Who is George W. Bush? • … George W. Bush, the 43rd President of the United States… • George W. Bush defeated Democratic incumbentAnn Richards to become the 46th Governor of the State of Texas… • Scoring • Any information nugget is acceptable • Precision score over all information nuggets
What <be> a <QP> ? Who <be> <QP> ? example: “Who is Zebulon Pike?” <QP>, the <AP> <QP> (a <AP>) <AP HumanConcept> <QP> example: “explorer Zebulon Pike” Question patterns Answer patterns Answer Detection with Pattern Matching • For Definition questions
Answer Detection with Concept Expansion • Enhancement for Definition questions • Identify terms that are semantically related to the phrase to define • WordNet hypernyms (more general concepts) • [published in AAAI Spring Symposium 2002]
Evaluation on Definition Questions • Determine the impact of answer type detection with pattern matching and concept expansion • test on the Definition questions from TREC-9 and TREC-10 (approx. 200 questions) • extract 50-byte answers • Results • precision score: 0.56 • questions with a correct answer among top 5 returned answers: 0.67
References • Marius Paşca. High-Performance, Open-Domain Question Answering from Large Text Collections, Ph.D. Thesis, Computer Science and Engineering Department, Southern Methodist University, Defended September 2001, Dallas, Texas • Marius Paşca. Open-Domain Question Answering from Large Text Collections, Center for the Study of Language and Information (CSLI Publications, series: Studies in Computational Linguistics), Stanford, California, Distributed by the University of Chicago Press, ISBN (Paperback): 1575864282, ISBN (Cloth): 1575864274. 2003
Overview • What is Question Answering? • A “traditional” system • Other relevant approaches • LCC´s PowerAnswer + COGEX • IBM’s PIQUANT • CMU’s Javelin • ISI’s TextMap • BBN’s AQUA • Distributed Question Answering
PowerAnswer + COGEX (1/2) • Automated reasoning for QA: A Q, using a logic prover. Facilititates both answer validation and answer extraction. • Both question and answer(s) transformed in logic forms. Example: • Heavy selling of Standard & Poor’s 500-stock index futures in Chicago relentlessly beat stocks downwards. • Heavy_JJ(x1) & selling_NN(x1) & of_IN(x1,x6) & Standard_NN(x2) & &_CC(x13,x2,x3) & Poor(x3) & ‘s_POS(x6,x13) & 500-stock_JJ(x6) & index_NN(x4) & futures(x5) & nn_NNC(x6,x4,x5) & in_IN(x1,x8) & Chicago_NNP(x8) & relentlessly_RB(e12) & beat_VB(e12,x1,x9) & stocks_NN(x9) & downward_RB(e12)
PowerAnswer + COGEX (2/2) • World knowledge from: • WordNet glosses converted to logic forms in the eXtended WordNet (XWN) project (http://www.utdallas.edu/~moldovan) • Lexical chains • game:n#3 HYPERNYM recreation:n#1 HYPONYM sport:n#1 • Argentine:a#1 GLOSS Argentina:n#1 • NLP axioms to handle complex NPs, coordinations, appositions, equivalence classes for prepositions etcetera • Named-entity recognizer • John Galt HUMAN • A relaxation mechanism is used to iteratively uncouple predicates, remove terms from LFs. The proofs are penalized based on the amount of relaxation involved.
IBM’s Piquant • Question processing conceptually similar to SMU, but a series of different strategies (“agents”) available for answer extraction. For each question type, multiple agents might run in parallel. • Reasoning engine and general-purpose ontology from Cyc used as sanity checker. • Answer resolution: remaining answers are normalized and a voting strategy is used to select the “correct” (meaning most redundant) answer.
Piquant QA Agents • Predictive annotation agent • “Predictive annotation” = the technique of indexing named entities and other NL constructs along with lexical terms. Lemur has built-in support for this now. • General-purpose agent, used for almost all question types. • Statistical Query Agent • Derivation from a probabilistic IR model, also developed at IBM. • Also general-purpose. • Description Query • Generic descriptions: appositions, parenthetical expressions. • Applied mostly to definition questions. • Structured Knowledge Agent • Answers from WordNet/Cyc. • Applied whenever possible. • Pattern-Based Agent • Looks for specific syntactic patterns based on the question form. • Applied when the answer is expected in a well-structured form. • Dossier Agent • For “Who is X?” questions. • A dynamic set of factual questions used to learn “information nuggets” about persons.
Pattern-based Agent • Motivation: some questions (with or without AT) indicate that the answer might be in a structured form • What does Knight Rider publish? transitive verb, missing object. • Knight Rider publishes X. • Patterns generated: • From a static pattern repository, e.g. birth and death dates recognition. • Dynamically from the question structure. • Matching of the expected answer pattern with the actual answer text is not at word level, but at a higher linguistic level based on full parse trees (see IE lecture).
Dossier Agent • Addresses “Who is X?” questions. • Generates initially a series of generic questions: • When was X born? • What was X’s profession? • Future iterations dynamically decided based on the previous answers? • If X’s profession is “writer” the next question is: What did X write? • A static ontology of biographical questions used.
CyC Sanity Checker • Post-processing component that • Rejects insane answers • “How much does a grey wolf weigh?” • “300 tons” • A grey wold IS-A wolf. Weight of a wolf known in Cyc. • Cyc returns: SANE, INSANE, or DON’T KNOW. • Boosts answer confidence when the answer is SANE. • Typically called for numerical answer types: • What is the population of Maryland? • How much does a grey wolf weigh? • How high is Mt. Hood?
Answer Resolution • Called when multiple agents are applied for the same question. Distribution of agents: the predictive-annotation and the statistical agent by far the most common. • Each agent provides a canonical answer (e.g. normalized named entity) and a confidence score. • Final confidence for each candidate answer computed using a ML model with SVM.
CMU’s Javelin • Architecture combines SMU’s and IBM’s approaches. • Question processing close to SMU’s approach. • Passage retrieval loop conceptually similar to SMU’s, but an elegant implementation. • Multiple answer strategies similar to IBM’s system. All of them are based on ML models (K nearest neighbours, decision trees) that use shallow-text features (close to SMU’s). • Answer voting, similar to IBM’s, used to exploit answer redundancy.
Javelin’s Retrieval Strategist • Implements passage retrieval, including the passage retrieval loop. • Uses the Inquiry IR system, probably Lemur by now. • The retrieval loop uses all keywords in close proximity of each other initially (stricter than SMU). • Subsequent iterations relax the following query terms • Proximity for all question keywords: 20, 100, 250, AND • Phrase proximity for phrase operators: less than 3 words or PHRASE • Phrase proximity for named entities: less than 3 words or PHRASE • Inclusion/exclusion of AT word • Accuracy for TREC-11 queries: how many questions had at least one correct document in the top N documents: • Top 30 docs: 80% • Top 60 docs: 85% • Top 120 docs: 86%
ISI’s TextMap: Pattern-Based QA • Examples • Who invented the cotton gin? • <who> invented the cotton gin • <who>'s invention of the cotton gin • <who> received a patent for the cotton gin • How did Mahatma Gandhi die? • Mahatma Gandhi died <how> • Mahatma Gandhi drowned • <who> assassinated Mahatma Gandhi • Patterns generated from the question form (similar to IBM), learned using a pattern discovery mechanism, or added manually to a pattern repository • The pattern discovery mechanism performs a series of generalizations from annotated examples: • Babe Ruth was born in Baltimore, on February 6, 1895. • PERSON was born *g* in DATE
TextMap: QA Machine Translation • In machine translation, one collects translations pairs (s, d) and learns a model how to transform the source s into the destination d. • QA is redefined in a similar way: collect question-answer pairs (a, q) and learn a model that computes the probability that a question is generated from the given answer: p(q | parsetree(a)). The correct answer maximizes this probability. • Only the subsets of answer parse trees where the answer lies are used as training (not the whole sentence). • An off-the-shelf machine translation package (Giza) used to train the model.