290 likes | 442 Views
JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from Multilingual, Distributed Sources Eric Nyberg, Teruko Mitamura, Jamie Callan, Jaime Carbonell, Bob Frederking Language Technologies Institute Carnegie Mellon University. JAVELIN II Research Areas. System: <displays.
E N D
JAVELIN II:Scenarios and Variable Precision Reasoning forAdvanced QA from Multilingual, Distributed Sources Eric Nyberg, Teruko Mitamura, Jamie Callan, Jaime Carbonell, Bob Frederking Language Technologies InstituteCarnegie Mellon University AQUAINT 6-month Meeting 10/08/04
JAVELIN II Research Areas System: <displays e1 e4 User: “I’m focusing on the instantiated scenario> r1 r2 r3 r4 new Iraqi minister Al User: “Can you find more Tikriti. What can you tell about his brother-in-law’s e2 e3 e5 me about his family and business associates?” associates?” 2. Scenario Representation 6. Answer Visualization and Scenario Refinement 1. Scenario Dialog Scenario Reasoning Search Answer Belief Guidance Justification Revision 3. Distributed, Multilingual 5. Variable-Precision Knowledge Retrieval Representation & Reasoning Pattern Statistical Emerging NL Parsing Matching Extraction Fact Base Relevant Documents 4. Multi-Strategy InformationGathering AQUAINT 6-month Meeting 10/08/04
Recent Highlights • Multi-Strategy Information Gathering • Participation in Relationship Pilot • Training Extractors with Minor Third • Variable-Precision KR and Reasoning • Text Processor Module(1st version complete) • Fact Base (1st prototype complete) • Distributed, Multilingual QA • Keyword Translation for CLQA(English to Chinese) AQUAINT 6-month Meeting 10/08/04
Relationship Pilot • 50 sample scenarios, e.g.The analyst is interested in knowing if a particular country is a member of an international organization. Is Chechnya a member of the United Nations? • Phase I JAVELIN system was used with manual tweaking • Output of Question Analyzer module was manually corrected • Decompose into subquestions (17 of 50 scenarios) • Gather key terms from background text AQUAINT 6-month Meeting 10/08/04
NIST Evaluation Methodology • Two categories of information “nuggets” vital : must be present okay : relevant but not necessary • Each item could match more than one nugget • Recall determined by vital nuggets • Precision based on answer length • Computed F-scores with recall 3 times as important as precision AQUAINT 6-month Meeting 10/08/04
JAVELIN Performance Statistics • Average F-score computed by NIST 0.298 • Average F-score with recall based on both vital and okay nuggets 0.322 • Total scenarios with F=1 0 • Total scenarios with all vital information correct: 9 • 1/1 – 18, 19, 36, 38 • 2/2 – 4, 16, 34, 37 • 3/3 – 33 • Total scenarios with F=0 19 • Total scenarios without any (vital or okay) correct answers 10 • no answer found - 3, 5 • bad answers - 6, 8, 10, 11, 13, 27, 29, 30 AQUAINT 6-month Meeting 10/08/04
JAVELIN Performance Statistics • Average recall (vital) 0.344 • Average precision 0.261 • Matches per answer item: • no nuggets 206 • 1 nugget matched 57 • 2 nuggets matched 10 • 3 nuggets matched 6 • 4 nuggets matched 1 • Not done (but potentially useful?): determine which decomposed questions we provided relevant information for AQUAINT 6-month Meeting 10/08/04
General Observations • Nugget quality and assessment varies considerably (e.g., question #3, #8) Nuggets overlap, repeat given information, sometimes represent cues not answers; doesn’t count other relevant information if it was not in the assessors’ original set • Difficult to assess retrieval performance No document IDs provided in the nugget file • Difficult to reproduce the precision scores Relevant text spans appear to have been manually determined and are not noted in the annotated file AQUAINT 6-month Meeting 10/08/04
http://minorthird.sourceforge.net/ • A standardized testbed to build and evaluate machine learning algorithms that work on text • Includes a pattern language (Mixup) for building taggers(compiles to FSTs) • Can we utilize MinorThird as a factory to build new information extractors for the QA task? AQUAINT 6-month Meeting 10/08/04
Initial Training Experiments • Can Minor Third train new taggers for specific tags and corpora, based on bootstrap information from existing tagger(s)? • Set Up: • Use Identifinder to annotate 101 messages(focus: ORGANIZATION) • Manually fix incorrect tags • Training set: 81; Test set: 20 • Experiments: • Vary training set size: 40, 61, 81 messages • Vary history size and window size parameters used by the Minor Third Learner class AQUAINT 6-month Meeting 10/08/04
Varying Size of Training Set AQUAINT 6-month Meeting 10/08/04
The Text Processor (TP) • A server capable of processing text annotation requests (batch or run-time) • Receives a text stream input and assigns multiple levels of tags or features • Application can specify which processors to run on a text, and in what order • Provides a single API for a variety of processors: • Brill Tagger • BBN Identifinder • MXTerminator • Link Parser • RASP • WordNet • CLAWS • FrameNet AQUAINT 6-month Meeting 10/08/04
TP Object Model TP Object Model AQUAINT 6-month Meeting 10/08/04
Fact Base • Relational data model containing: • Documents and metadata • Standoff annotations for: • Linguistic analysis(segmentation, POS, parsing, predicate extraction) • Semantic interpretation(frame filling -> facts/events/etc.) • Reasoning(reference resolution, inference) AQUAINT 6-month Meeting 10/08/04
3. Extracted frames are stored as possible facts, events, etc. Facts 2. Results are stored as features on text spans Features Text 1. Relevant documents or passages are processed by the TP modules Fact Base [2] Text Processor API Segmenter Taggers Parsers Framers * All derived information directly linked to input source(s) at each level * Persistent storage in RDBMS supports: - training/learning on any combination of features - reuse of results across sessions, analysts, etc. when appropriate - use of relational querying for association chains (cf. G. Bhalotia, et al., Keyword searching and browsing in databases using BANKS. In ICDE, San Jose, CA, 2002.) AQUAINT 6-month Meeting 10/08/04
CLQA: The Keyword Translation Problem • Given keywords extracted from the question, how do we correctly translate them into languages of the information sources? Keywords in Language B Keyword Translator Keywords in Language A Keywords in Language C AQUAINT 6-month Meeting 10/08/04
Tools For Query/Keyword Translation • Machine Readable Dictionaries (MRD) • Pros: • Easily obtained for high-density languages • Domain-specific dictionaries provide good coverage in-domain • Cons: • Publicly available general dictionaries usually have low coverage • Cannot translate sentences • MT Systems • Pros: • Usually provide more coverage than publicly available MRD • Translate whole sentences • Cons: • Translation quality varies • Low language-pair coverage compared to MRD • Parallel Corpora • Pros: Good for domain-specific translation • Cons: Poor for open-domain translation AQUAINT 6-month Meeting 10/08/04
Tools For Query/Keyword Translation • Machine Readable Dictionaries (MRD) • Pros: • Easily obtained for high-density languages • Domain-specific dictionaries provide good coverage in-domain • Cons: • Publicly available general dictionaries usually have low coverage • Cannot translate sentences • MT Systems • Pros: • Usually provide more coverage than publicly available MRD • Translate whole sentences • Cons: • Translation quality varies • Low language-pair coverage compared to MRD • Parallel Corpora • Pros: Good for domain-specific translation • Cons: Poor for open-domain translation AQUAINT 6-month Meeting 10/08/04
Research Questions • Can we improve keyword translation correctness by building a keyword selection model that selects one translation from translations produced by multiple MT systems? • Can we improve keyword translation correctness by using the question sentence? AQUAINT 6-month Meeting 10/08/04
The Translation Selection Problem • Given a set of translation candidates and the question sentence, how do we select a translation that is most likely a correct translation of the keyword? Target Keyword 1 SelectionModel Score for TargetKeyword 1 MT System 1 Target Question 1 SourceKeyword Target Keyword 2 SourceQuestion SelectionModel Score for TargetKeyword 2 MT System 2 Target Question 2 Target Keyword 3 SelectionModel Score for TargetKeyword 3 MT System 3 Target Question 3 AQUAINT 6-month Meeting 10/08/04
Keyword Selection Model • A set of scoring metrics: • A translation candidate is assigned an initial base score of 0 • Each scoring metric adds to or subtracts from running total of the score • After all candidates go through the model, the translation candidate with the highest score is selected as the most likely correct translation AQUAINT 6-month Meeting 10/08/04
The Experiment • Language Pair: From English to Chinese • Uses three free web-based MT systems • www.systranbox.com • www.freetranslation.com • www.amikai.com • Training Data: • 50 Input questions (125 Keywords) from TREC-8, TREC-9, and TREC-10 • Testing Data: • 50 Input questions (147 Keywords) from TREC-8, TREC-9, and TREC-10 • Evaluation: Translation correctness AQUAINT 6-month Meeting 10/08/04
Scoring Metrics • In this experiment, we constructed different selection models, each uses a combination of following 5 scoring metrics: • Baseline • Segmented Word-Matching and Partial Word-Matching • Full Sentence Word-Matching without Fall Back to Partial Word-Matching • Full Sentence Word-Matching with Fall Back to Partial Word-Matching • Penalty for Partially Translated or Un-Translated Keywords AQUAINT 6-month Meeting 10/08/04
Scoring Metrics Summary • Description of Scoring Metrics: • Scoring Legend: AQUAINT 6-month Meeting 10/08/04
Results Keyword Translation Accuracy of Different Models on the Test Set Improvement of Different Models over the Base Model [Lin, F. and T. Mitamura, “Keyword Translation from English to Chinese for Multilingual QA”, Proceedings of AMTA 2004, Georgetown.] AQUAINT 6-month Meeting 10/08/04
Results Keyword Translation Accuracy of Different Models on the Test Set • Best single MT system performance: • 78.23% • Best multiple MT model performance: • 85.71% • Best possible result if the correct keywords are selected every time they are produced: • 92.52% [Lin, F. and T. Mitamura, “Keyword Translation from English to Chinese for Multilingual QA”, Proceedings of AMTA 2004, Georgetown.] AQUAINT 6-month Meeting 10/08/04
Observations • Models which include scoring metrics that require segmentation did poorly • Using more MT systems improves translation correctness • Using the translated question improves keyword translation accuracy • There is still room for improvement(85.71% to 92.52%) AQUAINT 6-month Meeting 10/08/04
More to Do… • Use statistical/machine learning techniques • Result of each scoring metric a feature in a classification problem (SVM, MaxEnt) • Train weights for each scoring metric (EM) • Use additional / improved scoring metrics • Validate translation using search engines • Use better segmentation tools • Compare with other evaluation methods • retrieval performance • end-to-end system (QA) performance AQUAINT 6-month Meeting 10/08/04
Questions? AQUAINT 6-month Meeting 10/08/04