1 / 19

AMTEXT: Extraction-based MT for Arabic

Explore the use of extraction-based machine translation (MT) for improved accuracy and utility of information in Arabic text. This approach involves first extracting relevant information and then translating only the extracted information, minimizing the need for full MT.

battsr
Download Presentation

AMTEXT: Extraction-based MT for Arabic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMTEXT:Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

  2. Background and Objectives • Full MT of text is problematic: • Requires large amounts of resources, long development time • Quality of output varies • Analysts often are looking for limited concrete information within the text  full MT may not be necessary • Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information • Text Extraction technology has made much progress in past decade [TIPSTER, TREC, EELD] • Research Question: Can Extraction-based MT result in improved accuracy and utility of information for analysts? ITIC Site Visit

  3. Extraction-based MT • “Traditional” Approach: • Develop information extraction capability for the source language • Runtime Extractor produces a template of extracted feature-value information • If desired, English Generator can render the information in the form of text • Drawback: Adapting extraction technology to a new foreign language is difficult • Requires significant expertise in the foreign language • Significant amounts of human development time • Not clear that it is an attractive solution ITIC Site Visit

  4. AMTEXT Approach • Attempt to leverage from our work on automatic learning of MT transfer rules • Develop an elicitation corpus specifically designed for targeted extraction patterns • Learn generalized transfer rules for targeted extraction patterns from elicitation corpus • Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary • Runtime: use partial parser + transfer rules to translate only the matched portions of SL text ITIC Site Visit

  5. AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Transfer System Transfer Rules Partial Parser S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE]((X1::Y1) (X4::Y4) (X5::Y5)) Extracted Target Text Transfer Engine NE Translation Lexicon Word Translation Lexicon ITIC Site Visit

  6. Elicitation Example ITIC Site Visit

  7. Elicitation Example ITIC Site Visit

  8. Elicitation Example ITIC Site Visit

  9. Elicitation Example ITIC Site Visit

  10. Learning Transfer Rules • Different notion of rule generalization than in our full XFER approach • Generalize from examples to NEs that play specific roles in target extraction pattern • Verbs and function words may not be generalized • Example: Sharon will meet with Bush today sharon yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE]((X1::Y1) (X4::Y5) (X5::Y6)) ITIC Site Visit

  11. Acquisition of Named Entity Translation Lexicon • Utilize Fei Huang’s work on building Named Entity Translation Lexicons based on transliteration models • NE Lexicon will be split into meaningful sub-categories: PNs, Organizations, Locations, etc. • NE translation lexicon augmented with NEs from elicited data • Goal: High coverage and high accuracy identification of NEs that play a part in the transfer rules ITIC Site Visit

  12. Named Entity Translation Lexicon • English-Arabic lexicon from Fei: • Trained on TIDES Newswire Data • 7522 entries sorted by transliteration score • Example: 4.51948528108464 # XXX # # Israel # AsrAAyl 4.05498190544419 # XXX # # Kabul # kAbwl 3.66368346525326 # XXX # # Paris # bArys 3.65527347080481 # XXX # # Afghanistan # AfgAnstAn 3.47030997281853 # XXX # # Pakistan # bAkstAn 3.23199522148251 # XXX # # Moscow # mwskw 3.20392400497002 # XXX # # Arafat # ErfAt 3.13060360328543 # XXX # # Beirut # byrwt 3.06872591580516 # XXX # # Russia # rwsyA ITIC Site Visit

  13. Named Entity Identification • NE Identifinder for English • Available from BBN • Will be used for identifying English NEs within elicited data  Arabic NEs from word alignments • NE Identifinder for Arabic: • Requested from BBN, so far no response • Will use if available, can manage without it (naïve identification based on NE translation lexicon) ITIC Site Visit

  14. Acquisition of Limited Word Translation Lexicon • Vocabulary of interest is limited based on specific actions and objects that are of interest  scopeable on the English side • Elicitation corpus serves as a high-quality initial source for extracting this translation lexicon • Statistical word-to-word translation dictionary from SMT or EBMT can be used as a source for expanding coverage on the foreign language side • Experiment if time/resources permit with incorporating expanded vocabulary into transfer rules ITIC Site Visit

  15. Partial Parsing • Input: Full text in the foreign language • Output: Translation of extracted/matched text • Goal: Extract by effectively matching transfer rules with the full text • Identify/parse NEs and words in restricted vocabulary • Identify transfer-rule (source-side) patterns • Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xucshalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today ITIC Site Visit

  16. Scope of Pilot System • Arabic-to-English • Newswire text (available from TIDES) • Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… • Limited translation patterns: • <subj-NE> <verb> <obj> <LOC>* <TE>* • Limited vocabulary ITIC Site Visit

  17. Evaluation Plan • Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage • Establish a test set for evaluation • Define an appropriate metric: Precision/Recall/F1 of relations and entities • Compare performance ITIC Site Visit

  18. Current Status • Initial small elicitation corpus translated and aligned • Extraction of elicitation phrases from Penn-TB in advanced stages • Identifying scope of coverage: relations, actions, translation patterns • Preliminary NE translation lexicon available ITIC Site Visit

  19. Work Plan • Creation of full elicitation corpus: Nov-03 • Translation/align. of elicitation corpus: Nov/Dec-03 • Install and integrate BBN English Identifinder: Dec-03 • Acquire initial NE translation lexicon: Dec-03 • Acquire initial word translation lexicon: Dec-03 • Develop and integrate partial parser: Dec-03/Feb-04 • Modify Transfer Engine for AMTEXT configuration: Dec-03/Jan-04 • Integration of preliminary complete system: Feb-04 • Design of evaluation: Feb-04 • System testing and modifications: Feb/Apr-04 • Test-set evaluation: Apr-04 ITIC Site Visit

More Related