AMTEXT: Extraction-based MT for Arabic

AMTEXT:Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

Background and Objectives • Full MT of text is problematic: • Requires large amounts of resources, long development time • Quality of output varies • Analysts often are looking for limited concrete information within the text  full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • Text Extraction technology has made much progress in past decade [TIPSTER, TREC, EELD] • Research Question: Can Extraction-based MT result in improved accuracy and utility of information for analysts? ITIC Site Visit

Extraction-based MT • “Traditional” Approach: • Develop information extraction capability for the source language • Runtime Extractor produces a template of extracted feature-value information • If desired, English Generator can render the information in the form of text • Drawback: Adapting extraction technology to a new foreign language is difficult • Requires significant expertise in the foreign language • Significant amounts of human development time • Not clear that it is an attractive solution ITIC Site Visit

AMTEXT Approach • Attempt to leverage from our work on automatic learning of MT transfer rules • Develop an elicitation corpus specifically designed for targeted extraction patterns • Learn generalized transfer rules for targeted extraction patterns from elicitation corpus • Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary • Runtime: use partial parser + transfer rules to translate only the matched portions of SL text ITIC Site Visit

AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Transfer System Transfer Rules Partial Parser S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE]((X1::Y1) (X4::Y4) (X5::Y5)) Extracted Target Text Transfer Engine NE Translation Lexicon Word Translation Lexicon ITIC Site Visit

Elicitation Example ITIC Site Visit

Learning Transfer Rules • Different notion of rule generalization than in our full XFER approach • Generalize from examples to NEs that play specific roles in target extraction pattern • Verbs and function words may not be generalized • Example: Sharon will meet with Bush today sharon yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE]((X1::Y1) (X4::Y5) (X5::Y6)) ITIC Site Visit

Acquisition of Named Entity Translation Lexicon • Utilize Fei Huang’s work on building Named Entity Translation Lexicons based on transliteration models • NE Lexicon will be split into meaningful sub-categories: PNs, Organizations, Locations, etc. • NE translation lexicon augmented with NEs from elicited data • Goal: High coverage and high accuracy identification of NEs that play a part in the transfer rules ITIC Site Visit

Named Entity Translation Lexicon • English-Arabic lexicon from Fei: • Trained on TIDES Newswire Data • 7522 entries sorted by transliteration score • Example: 4.51948528108464 # XXX # # Israel # AsrAAyl 4.05498190544419 # XXX # # Kabul # kAbwl 3.66368346525326 # XXX # # Paris # bArys 3.65527347080481 # XXX # # Afghanistan # AfgAnstAn 3.47030997281853 # XXX # # Pakistan # bAkstAn 3.23199522148251 # XXX # # Moscow # mwskw 3.20392400497002 # XXX # # Arafat # ErfAt 3.13060360328543 # XXX # # Beirut # byrwt 3.06872591580516 # XXX # # Russia # rwsyA ITIC Site Visit

Named Entity Identification • NE Identifinder for English • Available from BBN • Will be used for identifying English NEs within elicited data  Arabic NEs from word alignments • NE Identifinder for Arabic: • Requested from BBN, so far no response • Will use if available, can manage without it (naïve identification based on NE translation lexicon) ITIC Site Visit

Acquisition of Limited Word Translation Lexicon • Vocabulary of interest is limited based on specific actions and objects that are of interest  scopeable on the English side • Elicitation corpus serves as a high-quality initial source for extracting this translation lexicon • Statistical word-to-word translation dictionary from SMT or EBMT can be used as a source for expanding coverage on the foreign language side • Experiment if time/resources permit with incorporating expanded vocabulary into transfer rules ITIC Site Visit

Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today ITIC Site Visit

Scope of Pilot System • Arabic-to-English • Newswire text (available from TIDES) • Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… • Limited translation patterns: • <subj-NE> <verb> <obj> <LOC>* <TE>* • Limited vocabulary ITIC Site Visit

Evaluation Plan • Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage • Establish a test set for evaluation • Define an appropriate metric: Precision/Recall/F1 of relations and entities • Compare performance ITIC Site Visit

Current Status • Initial small elicitation corpus translated and aligned • Extraction of elicitation phrases from Penn-TB in advanced stages • Identifying scope of coverage: relations, actions, translation patterns • Preliminary NE translation lexicon available ITIC Site Visit

Work Plan • Creation of full elicitation corpus: Nov-03 • Translation/align. of elicitation corpus: Nov/Dec-03 • Install and integrate BBN English Identifinder: Dec-03 • Acquire initial NE translation lexicon: Dec-03 • Acquire initial word translation lexicon: Dec-03 • Develop and integrate partial parser: Dec-03/Feb-04 • Modify Transfer Engine for AMTEXT configuration: Dec-03/Jan-04 • Integration of preliminary complete system: Feb-04 • Design of evaluation: Feb-04 • System testing and modifications: Feb/Apr-04 • Test-set evaluation: Apr-04 ITIC Site Visit

AMTEXT: Extraction-based MT for Arabic