210 likes | 338 Views
AMTEXT: Extraction-based MT for Arabic. Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi. Goals and Approach. Analysts often are looking for limited concrete information within the text full MT may not be necessary
E N D
AMTEXT:Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi
Goals and Approach • Analysts often are looking for limited concrete information within the text full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • But – how do we extract just the relevant parts in the source language? • AMTEXT approach: • learn extraction patterns and their translations from smallamounts of human translated and aligned data • Combine with broad coverage Named-Entity translation lexicons • System output: translation of extracted information + a structured representation DoD KDL Visit
AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Extract Transfer System Transfer Rules Filled Template Partial Parser & Transfer Engine S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE]((X1::Y1) (X4::Y4) (X5::Y5)) Post-processor Extractor Extracted Target Text NE Translation Lexicon Word Translation Lexicon DoD KDL Visit
Elicitation Example DoD KDL Visit
Learning Extraction Translation Patterns • Elicited example: Sharon nifgash hayom im bush Sharon met with Bush today • After Generalization: <PERSON> <MEET-V> <TE> im <PERSON> <PERSON> <MEET-V> with <PERSON> <TE> • Resulting Learned Pattern Rule: S::S : [PERSON MEET-V TE im PERSON] -> [PERSON MEET-V with PERSON TE] ( (X1::Y1) (X2::Y2) (X3::Y5) (X5::Y4)) DoD KDL Visit
Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) DoD KDL Visit
The Transfer Engine DoD KDL Visit
Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Transfer Engine produces a complete lattice of transfer translations Sharon, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today DoD KDL Visit
Post Processing • Translation Selection Module: • select most complete and coherent translation from lattice based on scoring heuristics • Structure Extraction: • Extract translated entities from the pattern and display in a structured table format • Output Display: • Perl scripts construct HTML page for displaying complete translation results DoD KDL Visit
Translation Selection Module: Features • Goal: Scoring function that can identify the most likely best match • Lattice arc features from the transfer engine: • matched range of source • matched parts of target • transfer score • partial parse DoD KDL Visit
Lattice Example Arafat to meet Peres in Brussels on Monday ErfAt yltqy byryz msAA AlAvnyn fy brwksl (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")") (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will meet with")") (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")") (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy byryz" "((S,11 (PERSON,1 (PNAM E,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) ) )") (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")") (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")") (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4 (LITERAL "on")(DAY,0 "Monday") ) )") (1 5 "Arafat will meet with Peres on Monday" 3.2 "ErfAt yltqy byryz msAA AlAvnyn " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (P NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 5 "Arafat will meet with Peres Monday" 3.1 "ErfAt yltqy byryz msAA AlAvnyn" " ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAM E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )") (6 6 "fy" 3 "fy" "(UNK,2 "fy")") (7 7 "Brussels" 3 "brwksl" "(PLACE,0 "Brussels")") (6 7 "in Brussels" 2.9 "fy brwksl" "((LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) )") (1 7 "Arafat will meet with Peres in Brussels on Monday" 3.4 "ErfAt yltqy byryz msAA AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will me et with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels" ) ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels Monday" 3.3 "ErfAt yltqy byryz msA A AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) (TE,5 (DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels" 3.2 "ErfAt yltqy byryz msAA AlAvn yn fy brwksl" "((S,8 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) ) )") DoD KDL Visit
Example: Extracting Features • 1 5 Length (tokens) of source segment (ar) (1) • "Arafat will meet with Peres Monday" length of trans segment (2) • 3.1 transfer engine score (3) • "ErfAt yltqy byryz msAA AlAvnyn" length of source segment (4) • 1 2 3 4 5 • "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )" Transfer structure - full frame (S) or not? (5) • Secondary feature (6): relative lengths of (2) over (4) : the smaller, the more concise the source language match (less extraneous material, i.e. less chance of mistranslation). DoD KDL Visit
Selecting Best Translation For each parse Pj in the lattice, calculate a score Sj based on featuresfi with weight coefficients wi, as follows Weights wi trained by hill climbing (training set / manual reference parse) DoD KDL Visit
“Proof-of-Concept” System • Arabic-to-English • Newswire text (available from TIDES) • Very limited set of actions: (X meet Y) • Limited collection of translation patterns: • <Person-NE> <meet-verb> <Person-NE> <LOC>* <TE>* • Limited vocabulary and NE lexicon DoD KDL Visit
System Development • Training corpus of 535 short sentences translated and aligned by bilingual informant • 258 simple meeting sentences • 120 Temporal Expressions • 105 Location Expressions • 52 Title Expressions • Translation Lexicon of Names Entities (person names, organizations and locations) converted from Fei Huang’s NE translation/transliteration work • Pattern Generalizations semi-automatically “learned” from the training data • Patterns manually enhanced with “skipping markers” • Initial System integrated • Development with informant on 74 sentence dev data DoD KDL Visit
Resulting System • Transfer Grammar contains: • 21 transfer pattern rules • 12 Meet Verb rules • 4/17/11/17 Person/TE/LOC/PTitle “high-level” rules • Transfer Lexicon contains 3070 entries (mostly names and locations) • Estimated development effort/time: • ~20 hours with informant • ~50 hours of lexical and rule development DoD KDL Visit
Evaluation • Development set of 74 sentences • Test set of 76 unseen sentences with meeting information • Identified subset of each set on which meeting patterns could potentially apply (“Good”) • 53 development sentences • 44 test sentences DoD KDL Visit
Evaluation • Translation-based: • Unigram token-based retrieval metrics: precision / recall / F1 • Entity-based: • Recall for each role in the meeting frame (V, P1, P2, LOC and TE) • Partial recall credit for partial matches • Partial credit (50%) for P1/P2 role interchange DoD KDL Visit
Evaluation Results DoD KDL Visit
Demonstration http://www-2.cs.cmu.edu/afs/cs/user/alavie/Avenue/tmp/demo20sep/met.dev.htm DoD KDL Visit
Conclusions • Attractive methodology for joint extraction + translation of Essential Elements of Information from full foreign language texts • Rapid Development - circumvents need for developing high-quality full MT or high-quality IE technology for the foreign source language • Effective use of bilingual informants • Main Open Question – Scalability • Can this methodology be effective with much broader and more complex types of extracted EEIs? • Is automatic learning of generalized patterns feasible and effective in such more complex scenarios? • Can the selection heuristics effectively cope with the vast amounts of ambiguity expected in a large scale system? DoD KDL Visit