90 likes | 208 Views
AMTEXT: Extraction-based MT for Arabic. Alon Lavie, Jaime Carbonell Language Technologies Institute Carnegie Mellon University Email: {alavie,jgc}@cs.cmu.edu Project Members: Laura Kieras, Peter Jansen Informant: Loubna El Abadi. Objective.
E N D
AMTEXT:Extraction-based MT for Arabic Alon Lavie, Jaime Carbonell Language Technologies Institute Carnegie Mellon University Email: {alavie,jgc}@cs.cmu.edu Project Members: Laura Kieras, Peter Jansen Informant: Loubna El Abadi
Objective • Develop a framework for high-accuracy MT of extracted entities, objects and their relationships, which is: • Rapidly portable and adaptable to new source languages • Easily expandable to new types of entities and relationships ITIC MT Integration Meeting
AMTEXT Approach • Develop an elicitation corpus specifically designed for targeted extraction patterns • Learn generalized transfer rules for targeted extraction patterns from elicitation corpus • Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary • Runtime: use partial parser + transfer rules to translate only the matched portions of SL text ITIC MT Integration Meeting
Elicitation Example ITIC MT Integration Meeting
Learning Transfer Rules • Different notion of rule generalization than in our full XFER approach • Generalize from examples to NEs that play specific roles in target extraction pattern • Verbs and function words may not be generalized • Example: Peres will meet with Bush today peres yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE]((X1::Y1) (X4::Y5) (X5::Y6)) ITIC MT Integration Meeting
Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Handle expected high-levels of ambiguity Peres, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Peres will meet with Bush today ITIC MT Integration Meeting
Input/Output • Input: • Full text in source language (Arabic) • Output: • English translation of extracted entities and relationships • (Possibly also a structured representation) أعلنت صحيفة القدس العربي ومقرها لندن أنها تلقت الأحد بيانا يتبنى فيه تنظيم القاعدة بزعامة أسامة بن لادن الهجومين اللذين استهدفا كنيسين يهوديين في إسطنبول واللذين أسفرا عن مقتل 23 شخصا وإصابة 300 آخرين. وهدد البيان بتوجيه مزيد من الضربات للولايات المتحدة وحلفائها في جميع أنحاء العالم. The Abu Hafz al-Masri Brigades - al-Qaida warned car bombs killed 23 people injured 300 others AMTEXT System ITIC MT Integration Meeting
Scope of Pilot System • Arabic-to-English • Newswire text (available from TIDES) • Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… • Limited translation patterns: • <subj-NE> <verb> <obj> <LOC>* <TE>* • Limited vocabulary ITIC MT Integration Meeting
Evaluation Plan • Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage • Establish a test set for evaluation • Define an appropriate metric: Precision/Recall/F1 of relations and entities • Compare performance ITIC MT Integration Meeting