AMTEXT: Extraction-based MT for Arabic

AMTEXT:Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

Goals and Approach • Analysts often are looking for limited concrete information within the text  full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • But – how do we extract just the relevant parts in the source language? • AMTEXT approach: • learn extraction patterns and their translations from smallamounts of human translated and aligned data • Combine with broad coverage Named-Entity translation lexicons • System output: translation of extracted information + a structured representation DoD KDL Visit

AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Extract Transfer System Transfer Rules Filled Template Partial Parser & Transfer Engine S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE]((X1::Y1) (X4::Y4) (X5::Y5)) Post-processor Extractor Extracted Target Text NE Translation Lexicon Word Translation Lexicon DoD KDL Visit

Elicitation Example DoD KDL Visit

Learning Extraction Translation Patterns • Elicited example: Sharon nifgash hayom im bush Sharon met with Bush today • After Generalization: <PERSON> <MEET-V> <TE> im <PERSON> <PERSON> <MEET-V> with <PERSON> <TE> • Resulting Learned Pattern Rule: S::S : [PERSON MEET-V TE im PERSON] -> [PERSON MEET-V with PERSON TE] ( (X1::Y1) (X2::Y2) (X3::Y5) (X5::Y4)) DoD KDL Visit

Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) DoD KDL Visit

The Transfer Engine DoD KDL Visit

Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Transfer Engine produces a complete lattice of transfer translations Sharon, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today DoD KDL Visit

Post Processing • Translation Selection Module: • select most complete and coherent translation from lattice based on scoring heuristics • Structure Extraction: • Extract translated entities from the pattern and display in a structured table format • Output Display: • Perl scripts construct HTML page for displaying complete translation results DoD KDL Visit

Translation Selection Module: Features • Goal: Scoring function that can identify the most likely best match • Lattice arc features from the transfer engine: • matched range of source • matched parts of target • transfer score • partial parse DoD KDL Visit

Lattice Example Arafat to meet Peres in Brussels on Monday ErfAt yltqy byryz msAA AlAvnyn fy brwksl (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")") (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will meet with")") (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")") (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy byryz" "((S,11 (PERSON,1 (PNAM E,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) ) )") (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")") (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")") (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4 (LITERAL "on")(DAY,0 "Monday") ) )") (1 5 "Arafat will meet with Peres on Monday" 3.2 "ErfAt yltqy byryz msAA AlAvnyn " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (P NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 5 "Arafat will meet with Peres Monday" 3.1 "ErfAt yltqy byryz msAA AlAvnyn" " ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAM E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )") (6 6 "fy" 3 "fy" "(UNK,2 "fy")") (7 7 "Brussels" 3 "brwksl" "(PLACE,0 "Brussels")") (6 7 "in Brussels" 2.9 "fy brwksl" "((LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) )") (1 7 "Arafat will meet with Peres in Brussels on Monday" 3.4 "ErfAt yltqy byryz msAA AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will me et with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels" ) ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels Monday" 3.3 "ErfAt yltqy byryz msA A AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) (TE,5 (DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels" 3.2 "ErfAt yltqy byryz msAA AlAvn yn fy brwksl" "((S,8 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) ) )") DoD KDL Visit

Example: Extracting Features • 1 5  Length (tokens) of source segment (ar) (1) • "Arafat will meet with Peres Monday"  length of trans segment (2) • 3.1  transfer engine score (3) • "ErfAt yltqy byryz msAA AlAvnyn"  length of source segment (4) • 1 2 3 4 5 • "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )"  Transfer structure - full frame (S) or not? (5) • Secondary feature (6): relative lengths of (2) over (4) : the smaller, the more concise the source language match (less extraneous material, i.e. less chance of mistranslation). DoD KDL Visit

Selecting Best Translation For each parse Pj in the lattice, calculate a score Sj based on featuresfi with weight coefficients wi, as follows Weights wi trained by hill climbing (training set / manual reference parse) DoD KDL Visit

“Proof-of-Concept” System • Arabic-to-English • Newswire text (available from TIDES) • Very limited set of actions: (X meet Y) • Limited collection of translation patterns: • <Person-NE> <meet-verb> <Person-NE> <LOC>* <TE>* • Limited vocabulary and NE lexicon DoD KDL Visit

System Development • Training corpus of 535 short sentences translated and aligned by bilingual informant • 258 simple meeting sentences • 120 Temporal Expressions • 105 Location Expressions • 52 Title Expressions • Translation Lexicon of Names Entities (person names, organizations and locations) converted from Fei Huang’s NE translation/transliteration work • Pattern Generalizations semi-automatically “learned” from the training data • Patterns manually enhanced with “skipping markers” • Initial System integrated • Development with informant on 74 sentence dev data DoD KDL Visit

Resulting System • Transfer Grammar contains: • 21 transfer pattern rules • 12 Meet Verb rules • 4/17/11/17 Person/TE/LOC/PTitle “high-level” rules • Transfer Lexicon contains 3070 entries (mostly names and locations) • Estimated development effort/time: • ~20 hours with informant • ~50 hours of lexical and rule development DoD KDL Visit

Evaluation • Development set of 74 sentences • Test set of 76 unseen sentences with meeting information • Identified subset of each set on which meeting patterns could potentially apply (“Good”) • 53 development sentences • 44 test sentences DoD KDL Visit

Evaluation • Translation-based: • Unigram token-based retrieval metrics: precision / recall / F1 • Entity-based: • Recall for each role in the meeting frame (V, P1, P2, LOC and TE) • Partial recall credit for partial matches • Partial credit (50%) for P1/P2 role interchange DoD KDL Visit

Evaluation Results DoD KDL Visit

Demonstration http://www-2.cs.cmu.edu/afs/cs/user/alavie/Avenue/tmp/demo20sep/met.dev.htm DoD KDL Visit

Conclusions • Attractive methodology for joint extraction + translation of Essential Elements of Information from full foreign language texts • Rapid Development - circumvents need for developing high-quality full MT or high-quality IE technology for the foreign source language • Effective use of bilingual informants • Main Open Question – Scalability • Can this methodology be effective with much broader and more complex types of extracted EEIs? • Is automatic learning of generalized patterns feasible and effective in such more complex scenarios? • Can the selection heuristics effectively cope with the vast amounts of ambiguity expected in a large scale system? DoD KDL Visit

AMTEXT: Extraction-based MT for Arabic

AMTEXT: Extraction-based MT for Arabic

Presentation Transcript

History of Mathematics

Rule-based approach in Arabic NLP: Tools, Systems and Resources

Feature Extraction

Extraction Site Ridge Preservation

Information Extraction from Scientific Texts

Exploring Structure and Content on the Web Extraction and Integration of the Semi-Structured Web

Information Extraction

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Information Extraction

Managing Information Extraction SIGMOD 2006 Tutorial

Extraction Metallurgy

My Name Arabic One Shawnee Mission Schools

Metadata Extraction: Human Language Technology and the Semantic Web

Relation Extraction and Machine Learning for IE

Appraisal, Extraction and Pooling of Qualitative Data and Text

Outline

Feature Extraction for speech applications

Extraction Metallurgy of Copper

Novel Speech Recognition Models for Arabic