490 likes | 586 Views
High Precision Interactive Question Answering (HIREQA-ML). Language Computer Corporation Sanda Harabagiu, PI John Lehmann, John Williams, Finley Lacatusu, Andrew Hickl, Robert Hawes, Paul Aarseth, Luke Nezda, Jeremy Bensley,Patrick Wang, Seth Cleveland and Ricardo Ruiz. Innovative Claims.
E N D
High Precision Interactive Question Answering (HIREQA-ML) Language Computer Corporation Sanda Harabagiu, PI John Lehmann, John Williams, Finley Lacatusu, Andrew Hickl, Robert Hawes, Paul Aarseth, Luke Nezda, Jeremy Bensley,Patrick Wang, Seth Cleveland and Ricardo Ruiz
Innovative Claims • Multi-Strategy Q/A √ • Innovations in Question Decomposition and Answer Fusion • Bootstrapping Q/A √ • Predictive Interactive QA √ • Recognition of Incorrect and Missing Answers • Processing Negation in Q/A
Multi-Strategy Q/A • Our fundamental premise is that progress in Q/A cannot be achieved only by enhancing the processing components, but it also requires generating the best strategies for processing each individual question. • Current pipeline approach: • Question processing • Passage retrieval • Answer selection • Complex questions need processing in which we also consider • The context/topic of the questioning • The previous interactions
Multi-Strategy Question Answering • Strategies based on question type, question focus, question topic, paraphrases • Strategies that impose passage retrieval by normalized keyword selection and web relevance • Strategies that extract and fuse answers based on kernel methods/counter-training
Answer Resolution Strategies Answer Selection n Answer Selection 2 Answer Selection 1 Passage Retrieval 1 Passage Retrieval 2 Passage Retrieval n Interactive Question Answering User Background Answer Fusion Question Decomposition Counter-Training for Answer Extraction Question Processing Strategies Passage Retrieval Strategies Question Analysis Question Analysis … English Question English Answer … Question Analysis Question Analysis 1 million question/answer pairs
Problems Addressed • Flexible architecture • Enable the implementation of multi-strategies • Palantir • Predictive Q/A – Dialog Architecture Ferret • Large set of Question/Answer pairs: 1 million mark • Evaluations: • TREC, • QA with Relations, • Dialog Evaluations in the ARDA Challenge Workshop
Palentir • Rewritten in Java – before the project started • ~70,000 lines of code • 139 named entities • 150 answer types • Allows for: • Multiple forms of question analysis • Multiple passage retrieval strategies • Multiple answer selection methods • Incorporation of context in question processing and answer selection • Modeling of Topic for Q/A in Restricted Domains
What is a topic? • A topic represents an information need that can be cast into a template representation. • Inherently similar to information extraction tasks: • Management Succession • Natural Disasters • Market Change • Movement of People
What is a scenario? • A scenario is a description of a problem that requires a brief 1-2 page report that clearly identifies findings, justifications, and gaps in the information. • It includes a number of subject areas, including: • Country Profile • Government: Type, Leadership, Relations • Military Organization and Operations: Army, Navy, Air Force, Leaders, Capabilities, Intentions • Allies/Partners: Countries, Coalition Members, Forces • Weapons: Conventional, Chemical, Biological, Materials, Facilities, Stockpiles, Access, Research Efforts, Scientists Meta-Template
Scenario Example Q/A IE As terrorist activity in Egypt increases, the Commander of the United States Army believes a better understanding of Egypt’s military capabilities is needed. Egypt’s biological weapons database needs to be updated to correspond with the Commander’s request. Focus your investigation on Egypt’s access to old technology, assistance received from the Soviet Union for development of their pharmaceutical infrastructure, production of toxins and BW agents, stockpiles, exportation of these materials and development technology to Middle Eastern countries, and the effect that this information will have on the United States and Coalition Forces in the Middle East. Please incorporate any other related information to your report.
Examples of Questions Egypt’s Biological Weapons Stockpiles What biological weapons agents may be included in Egypt’s BW stockpiles? Where did Egypt obtain its first stockpiles of chemical weapons? Is there evidence that Egypt has dismantled its stockpiles of chemical and biological weapons? Did Egypt get rid of its stockpiles of chemical and biological weapons? Will Egypt dismantle its CBW stockpiles?
Relation Extraction and Q/A Egypt CBW Stockpile • A slot in a template/templette represents an implicit relation to a topic. • These relations encode a set of predications that are relevant to the relation: • Dismantle: <Arg0= Egypt, Arg1= Stockpile (CBW)> • Inherit: <Arg0= Egypt, Arg1= Stockpile, Arg2= British> • These predications need to be discovered and associated with question-answer pairs.
Creating Question/Answer Pairs How do we create question/answer pairs? • Convert each templateslot to a topic theme. • Use incremental topic representations to discover all available predications. • Select text surrounding each predication as the answer of the pair. • Generate questions by automatically paraphrasing the selected answer.
Large Set of Question-Answer Pairs • Used human-generated question/answer pairs for the CNS data • Results: 5,134 pairs • Harvesting of question/answer pairs from funtrivia.com • Over 600,000 pairs • Created manual paraphrases of all questions evaluated in TREC until now
Efficient Topic Representation as Topic Signatures • Topics may be characterized through a lexically determined topic signature (Lin and Hovy 2000) • TS = {topic,<(t1,w1),(t2,w2),…,(tn,wn)>} • Where each of the terms ti is tightly correlated with the topic, with an associated weight wi • Example • TS = {“terrorism”, <(bomb, 0.92),(attack, 0.89), (killing,0.83),…)} • The terms and the weights are determined by using the likelihood ratio
Our Findings • Topics are not characterized only by terms, there are also relationsbetween topics and concepts that need to be identified. • Assumption (Harabagiu 2004) • The largest majority of topic-relevant relations take place between verb/nominalizations & other nouns • Topic Representation can be produced in two iterations: • TS1 = {topic,<(t1,w1),(t2,w2),…,(tn,wn)>} • TS2={topic,<(r1,w1),(r2,w2),…,(rm,wm)>} where ri is a binary relation between two topic concepts • How do we determine the relations ri? • The idea: • Start with a seed relation rs • Discover new relations relevant to the topic
Selecting the seed relation • Three step procedure: • Morphological expansion of all lexemes relevant to the topic • Semantic normalization (based on ontological resources) • Selection of the predominant [predicate-argument] relation that is syntactically expressed either as a V-Subject or as a V-Object, V-PP(attachment_.
Topic Relations • Two forms of topic relations are considered • Syntax based relations between the VP and its subject, object, or prepositional attachment • C-relations representing relations between events and entities that cannot be identified by syntactic constraints • C-relations are motivated by • Frequent collocations of certain nouns with the topic verbs or normalizations • An approximation of intra-sentential centering introduced in (Kameyama, 1997)
The model of Discovering Topic Relations • Step 1: Retrieve Relevant Passages • Use the seed relation and any new discovered relation • Step 2:Generate Candidate Relations • Two types of relations • Syntax based relations • Salience based relations • Step 3: Rank the Relevance of the Relations • Step 4: Add the Relevant relations of the Topic representation • Step 5: Determine continuation/stop condition (Counter-training Yangarber 2003)
Syntax-Based Relations • From every document from Dq extract all • Verb-subject, verb-object, verb-prep attachment relations • Recognized by a FSA-based parser • Expand syntax based relations • Replace each word with its root form • “wounded” -> “wound” • “trucks” -> “truck” • Replace each word with any of the concepts that subsume it in a hand-crafted ontology • “truck” -> VEHICLE • “truck” -> ARTIFACT • Replace each named entity with its name class • “Bank of America” -> ORGANIZATION
Expansion of Relations exploded truck exploded Colombo 1.explode truck 2.explode VEHICLE 3.explode ARTIFACT 4.explode OBJECT 5.EXPLODE_WORD truck 6.EXPLODE_WORD VEHICLE 7.EXPLODE_WORD ARTIFACT 8.EXPLODE_WORD OBJECT 1.explode Colombo 2.explode CITY_NAME 3.explode LOCATION_NAME 4.EXPLODE_WORD Colombo 5.EXPLODE_WORD CITY_NAME 6.EXPLODE_WORD LOCATION_NAME
Salience – Based C-relations • Additional topic relations may be discovered within a salience window for each verb. A window of k=2 sentences preceding and succeeding the verb. • The NPs of each salience window are extracted and ordered. • Basic hypothesis – a C-relation between a verb and an entity from its domain of relevance are similar to anaphoric relations between entities in texts.
Candidate C-relations • In each salience window, [Trigger Verb -> NPi] Relations are: • created • Expanded similarly as it is done for syntax relations • Caveat: When considering our expansion for[Trigger Verb -> NPj] – the expansion is allowed only if it was not already introduced by any other expansion [Trigger Verb -> NPk] with k < j
STEP3: Rank Candidate Relations Following the method introduced in AutoSlog (Riloff 1996), each relation is ranked based on its: • Relevance – Rate • Frequency: the number of times a relation is identified in R In a single document, one relation may be identified multiple times. Where Countis the max number of times the relations is recognized in any given document. Relations with are discarded Consider only relations with:
FERRET Overview • LCC’s Ferret Dialogue System was designed to study analysts’ extended interaction with: • Current “best practices” in Automatic Question/Answering • Production quality Q/A system • Extensive domain-specific knowledge • Human-created question/answer pairs from document collection • User-interface that mimicked common web-browsing applications • Reduced time-to-learn, overall complexity of task
The story so far Automatic Q/A Knowledge Base Experts need access to high-precisionQ/A systems. Novices need access to sources of domain-specific knowledge. Systems need to seamlessly integrate multiple sources of information. Systems need to account for varying levels of expertise “on the fly”. We hypothesize that these four goals can be addressed through a dialogue interface. Systems should provide sophisticated information and be easy to use. Systems need not be overprovisioned with functionality to be effective. Systems should make use of analysts’ existing computer and research skills.
Designing Ferret Three overarching design principles for Ferret: • High-Precision Q/A (Palantir): • Offer users full-time use of state-of-the-art Palantir automatic Q/A system. • Return both automatically-generated answers and full doc contexts. • Question-Answer Base (QUAB): • Provide users with extensive domain-specific knowledge-base. • Organize information into human-crafted question-answer pairs. • Return both identified answers and full doc contexts. • Simple Integrated Interface: • Provide both automatically-generated and QUAB answers simultaneously. • Mimic functionality of browser applications that users will be familiar with. • Reduce confusion and time to learn by providing a minimum of “extra” tools. Question-Answer Base (QUAB) Simple IntegratedInterface High-PrecisionAutomatic Q/A
Two Kinds of Answers High-PrecisionAutomatic Q/A Question-Answer Base • Ferret provides two kinds of answers to questions: • Answers derived from LCC’s Palantir automatic question-answering system Does Al-Qaeda have biological or chemical weapons? • Answers extracted from a human-generated “question-answer base” (QUAB) containing over 5000 domain-specific question-answer pairs created from the workshop document collection.
Potential Challenges for QUAB Question-Answer Base …and some challenges as well: • Information ContentWill developers be able to identify information that be consistently useful to analysts? • Scope and CoverageHow many questions does it take to get adequate coverage? How much time does it take to create such a collection? • RelevanceHow do you determine which QUAB pairs should be returned to a user’s query? • AdoptionWill analysts accept information provided by non-expert users? • IntegrationHow do you add this new source of information to existing interactive Q/A architecture?
Building the QUAB Collection Topic: Libya’s CBW Programs In this volatile region, the proliferation of NBC weapons and the means to deliver them poses a significant challenge to our ability to achieve these goals. Iran, Iraq, and Libya are aggressively seeking NBC weapons and missile capabilities,constituting the most pressing threats to regional stability. Question-Answer Base • A team of 6 developers (with no particular expertise in the domain topics) were tasked with creating the QUAB collection. • For each scenario, developers were asked to identify passages in the document collection which might prove useful to someone conducting research on the domain. • Once these snippets were extracted, developers created a question which could be answered by the text passage. Is Libya seekingCBW weapons?
Distribution of the QUAB Collection Question-Answer Base • 5147 hand-crafted question-answer pairs in QUAB • 3210 pairs for 8 “Testing” domains • 342 pairs for 6 “Training domains • 1595 terrorism-related pairs added to augment training data • Approximately 180 person hours needed to build QUAB collection
Selecting QUAB Pairs Question-Answer Base • Ferret employs a complex concept-matching system to identify the QUAB questions that are most appropriate for a user’s particular query. What is North Korea’s current CW weapons capability? Where does North Korea weaponize its CW? What are North Korea’s CW capabilities? What chemical weapons capabilities did North Korea have prior to 1980? When was North Korea first capable of producing chemical weapons in large quantities? What could motivate North Korea to make a chemical weapons attack? How many tons of CW is North Korea expected to be able to produce annually? Which countries have or are pursuing CW capabilities? Does North Korea have the ability to rapidly prepare CW and BW? How has the unavailability of CW precursors affected North Korea’s ability to produce certain kinds of CW? How much did Iran pay North Korea to develop a ballistic missile capable of carrying a chemical weapons payload? Actual Dialogue: Day 4, CH8 – Question 1
The Ferret Interface Simple Interface • The control bar found at the top of every screen replicates basic browser functions such as navigation (Back, Forward, History), text searching (Find), copy-and-paste, and query submission (Ask). • Basic on-line help is also provided.
The Ferret Interface QUAB Answers Palantir Answers Simple Interface • Answers are presented simultaneously on the browser’s screen: • Palantir answers are listed on the left-hand side of the screen. • QUAB pairs are found on the right-hand side of the screen.
Number of Answers Simple Interface • The top 150 Palantir answers (ranked in order of relevance) are returned for each query. • Keywords are presented in bold. Document links in blue. • The top 10 QUAB pairs (also ranked in order relevance) are returned for each question submitted to Ferret. • Only a short snippet of the answer is presented on the main screen.
Getting Full Docs from Palantir Answers Main Screen Full Document Simple Interface • Only short snippets of Palantir answers and QUAB pairs are presented on the main screen. • Users can click on links associated with each Palantir snippet to view the full text of the document that the answer was extracted from. (The actual answer text is highlighted in yellow.)
Getting Full Docs from QUAB Pairs Main Screen Full Answer Full Document Simple Interface • With QUAB question-answer pairs, users can click on the link on the main screen to receive the full text of the answer identified by the annotator. • Users can click a link on the screen displaying the full answer text to view the entire text of the document. (Again, the actual answer text is highlighted in yellow.)
Using QUAB as a source for Palantir Full Answer Main Screen Simple Interface • Users can also re-submit a QUAB question to Ferret’s Palantir automatic Q/A system by clicking the Find more answers like this one link on the QUAB answer page. • This function allows users to check the answer provided in QUAB against other potential answers found in document collection.
2004 ARDA AQUAINT Dialogue Workshop • This summer’s ARDA-sponsored Dialogue Workshop provided us with opportunity to test the effectiveness of Ferret in an extensive real-world experiment. • 3 weeks at Pacific Northwest National Laboratories • 8 intelligence analysts (7 Navy, 1 Army) • 16 “real” research scenarios (AFRL, Rome Labs) • 4 participating systems (Ferret, GINKO (Cyc), HITIQA (Albany), GNIST (NIST)) • Workshop opportunities: • First chance to gather extensive data on these kinds of dialogues • Feedback from actual end-users of interactive systems • Interaction in real time • Dialogues produced by “real” analysts in “real” scenarios • Opportunity to team with other system developers • Chance to demo systems at October’s AQUAINT meeting
Types of User Interactions • 500 questions were either asked or selected across 16 sessions • (Average: 31.25 questions/session)
Type of Questions: User Comparison QUAB Q User Q “Find More” Significant difference in terms ofnumbers of QUAB questions selectedby users (p > 0.05).
Dialogue Preferences • Users typically consult at least one QUAB pair for almost every question submitted to Palantir. • User Dialogues: Streaks of User Questions • Average: 1.84 intervening non-user questions • Minimum: 0.66 questions; Maximum: 2.75 questions • System Dialogues: Streaks of QUAB Selections • Average: 2.68 non-user Qs • Minimum: 1.74 questions; Maximum: 3.67 questions;
How deep do analysts search for answers? Average Depth of Search Percentage Deep • Analysts tended to search through almost all of the QUAB answers returned, but only looked at 1/5 of Palantir answers. • QUAB Full Answer: 88.8% (8.8th answer) • QUAB Document: 75.9% (7.6th answer) • Palantir Document: 18.34% (27.5th answer)
Lessons Learned • Results from workshop allow us to see first examples of extensive unsupervised interactions with Q/A systems. • Analysts’ dialogues are markedly different than human dialogues: • Topics shift (and are returned to) repeatedly • Can lapse into “search engine” strategies • Not constrained by a need to establish topics, domain of interest, question under discussion • Ultimately, the best systems will need to be flexible and reactive • Anticipate potential information needs • Refrain from imposing particular information-seeking discourse structures on users • Be able to track multiple topics simultaneously
Questions? Comments? Thank you!