300 likes | 464 Views
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System. Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval Krymolowski - University of Haifa
E N D
Rapid Prototyping of a Transfer-based Hebrew-to-EnglishMachine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval Krymolowski - University of Haifa Erik Peterson – Carnegie Mellon University
Outline • Context of this Work • CMU Statistical Transfer MT Framework • Hebrew and its Challenges for MT • Hebrew-to-English System • Morphological Analysis and Generation • MT Resources: lexicon and grammar • Translation Examples • Performance Evaluation • Conclusions, Current and Future Work ISCOL/BISFAI-2007
Current State-of-the-art in Machine Translation • MT underwent a major paradigm shift over the past 15 years: • From manually crafted rule-based systems with manually designed knowledge resources • Tosearch-based approaches founded on automatic extraction of translation models/units from large sentence-parallel corpora • Current Dominant Approach: Phrase-based Statistical MT: • Extract and statistically model large volumes of phrase-to-phrase correspondences from automatically word-aligned parallel corpora • “Decode” new input by searching for the most likely sequence of phrase matches, using a statistical Language Model for the target language ISCOL/BISFAI-2007
Current State-of-the-art in Machine Translation • Phrase-based MT State-of-the-art: • Requires minimally several million words of parallel text for adequate training • Limited to language-pairs for which such data exists: major European languages, Chinese, Japanese, a few others… • Linguistically shallow and highly lexicalized models result in weak generalization • Best performance levels (BLEU=~0.6) on Arabic-to-English provide understandable but often still somewhat disfluent translations • Ill suited for Hebrew and most of the world’s minor languages ISCOL/BISFAI-2007
CMU’s Statistical-Transfer (XFER) Approach • Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts • Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences • Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages • XFER + Decoder: • XFER engine produces a lattice of possible transferred structures at all levels • Decoder searches and selects the best scoring combination • Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants • Word and Phrase bilingual lexicon acquisition ISCOL/BISFAI-2007
Hebrew Input בשורה הבאה Preprocessing Morphology Transfer Rules English Language Model {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Transfer Engine Translation Lexicon Decoder N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line
Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) ISCOL/BISFAI-2007
The Transfer Engine • Main algorithm: chart-style bottom-up integrated parsing+transfer with beam pruning • Seeded by word-to-word translations • Driven by transfer rules • Generates a lattice of transferred translation segments at all levels • Some Unique Features: • Works with either learned or manually-developed transfer grammars • Handles rules with or without unification constraints • Supports interfacing with servers for morphological analysis and generation • Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures ISCOL/BISFAI-2007
XFER Output Lattice (28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')") (29 29 "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ") (29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ") (29 29 "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ") (30 30 "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ") (30 30 "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ") (30 30 "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ") (30 30 "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ") (30 30 "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ") (30 30 "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ") (30 30 "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ") (30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ") ISCOL/BISFAI-2007
The Lattice Decoder • Simple Stack Decoder, similar in principle to simple Statistical MT decoders • Searches for best-scoring path of non-overlapping lattice arcs • No reordering during decoding • Scoring based on log-linear combination of scoring components, with weights trained using MERT • Scoring components: • Statistical Language Model • Fragmentation: how many arcs to cover the entire translation? • Length Penalty • Rule Scores • Lexical Probabilities (not fully integrated) ISCOL/BISFAI-2007
XFER Lattice Decoder 0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0, Words: 13,13 235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE') (NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))> 918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))> 584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))> ISCOL/BISFAI-2007
XFER MT Prototypes • General XFER framework under development for past five years • Prototype systems so far: • German-to-English • Dutch-to-English • Chinese-to-English • Hindi-to-English • Hebrew-to-English • In progress or planned: • Mapudungun-to-Spanish • Quechua-to-Spanish • Brazilian Portuguese-to-English • Native-Brazilian languages to Brazilian Portuguese • Hebrew-to-Arabic ISCOL/BISFAI-2007
Challenges for Hebrew MT • Puacity in existing language resources for Hebrew • No publicly available broad coverage morphological analyzer • No publicly available bilingual lexicons or dictionaries • No POS-tagged corpus or parse tree-bank corpus for Hebrew • No large Hebrew/English parallel corpus • Scenario well suited for CMU transfer-based MT framework for languages with limited resources ISCOL/BISFAI-2007
Modern Hebrew Spelling • Two main spelling variants • “KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed • “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter • KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling • Example: • niqud (spelling): NIQWD, NQWD, NQD • When written as NQD, could also be niqed, naqed, nuqad ISCOL/BISFAI-2007
Morphological Analyzer • We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system • Coverage is reasonable (for nouns, verbs and adjectives) • Produces all analyses or a disambiguated analysis for each word • Output format includes lexeme (base form), POS, morphological features • Output was adapted to our representation needs (POS and feature mappings) ISCOL/BISFAI-2007
Morphology Example • Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| ISCOL/BISFAI-2007
Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE)) ISCOL/BISFAI-2007
Translation Lexicon • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us, augmented by other public sources • Coverage is not great but not bad as a start • Dahan H-to-E is about 15K translation pairs • Dahan E-to-H is about 7K translation pairs • Base forms, POS information on both sides • Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.) • Had to deal with spelling conventions • Recently augmented with ~50K translation pairs extracted from Wikipedia (mostly proper names and named entities) ISCOL/BISFAI-2007
Manual Transfer Grammar (human-developed) • Initially developed by Alon in a couple of days, extended and revised by Nurit over time • Current grammar has 36 rules: • 21 NP rules • one PP rule • 6 verb complexes and VP rules • 8 higher-phrase and sentence-level rules • Captures the most common (mostly local) structural differences between Hebrew and English ISCOL/BISFAI-2007
Transfer GrammarExample Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) ) ISCOL/BISFAI-2007
Hebrew-to-English MT Prototype • Initial prototype developed within a two month intensive effort • Accomplished: • Adapted available morphological analyzer • Constructed a preliminary translation lexicon • Translated and aligned Elicitation Corpus • Learned XFER rules • Developed (small) manual XFER grammar • System debugging and development • Evaluated performance on unseen test data using automatic evaluation metrics ISCOL/BISFAI-2007
Example Translation • Input: • לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה • After debates many decided the government to hold referendum in issue the withdrawal • Output: • AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL ISCOL/BISFAI-2007
Noun Phrases – Construct State החלטת הנשיא הראשון HXL@T [HNSIA HRA$WN]decision.3SF-CS the-president.3SM the-first.3SM THE DECISION OF THE FIRST PRESIDENT החלטת הנשיא הראשונה [HXL@T HNSIA] HRA$WNHdecision.3SF-CS the-president.3SM the-first.3SF THE FIRST DECISION OF THE PRESIDENT ISCOL/BISFAI-2007
Noun Phrases - Possessives הנשיא הכריז שהמשימה הראשונהשלו תהיה למצוא פתרון לסכסוך באזורנו HNSIA HKRIZ $HM$IMH HRA$WNH $LW THIHthe-president announced that-the-task.3SF the-first.3SF of-him will.3SF LMCWA PTRWN LSKSWK BAZWRNWto-find solution to-the-conflict in-region-POSS.1P Without transfer grammar: THE PRESIDENT ANNOUNCED THAT THE TASK THE BESTOF HIM WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR With transfer grammar: THE PRESIDENT ANNOUNCED THAT HIS FIRST TASK WILL BE TO FIND A SOLUTION TO THE CONFLICT IN OURREGION ISCOL/BISFAI-2007
Subject-Verb Inversion אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא ATMWL HWDI&H HMM$LHyesterday announced.3SF the-government.3SF $T&RKNH BXIRWT BXWD$ HBAthat-will-be-held.3PFelections.3PF in-the-month the-next Without transfer grammar: YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF THE MONTH THE NEXT With transfer grammar: YESTERDAY THE GOVERNMENT ANNOUNCED THAT ELECTIONS WILL ASSUME IN THE NEXT MONTH ISCOL/BISFAI-2007
Subject-Verb Inversion לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה LPNI KMH $BW&WT HWDI&H HNHLT HMLWNbefore several weeks announced.3SF management.3SF.CS the-hotel $HMLWN ISGR BSWF H$NH that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year Without transfer grammar: IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR With transfer grammar: SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR ISCOL/BISFAI-2007
Test set of 62 sentences from Haaretz newspaper, 2 reference translations Evaluation Results ISCOL/BISFAI-2007
Current and Future Work • Issues specific to the Hebrew-to-English system: • Coverage: further improvements in the translation lexicon and morphological analyzer • Manual Grammar development • Acquiring/training of word-to-word translation probabilities • Acquiring/training of a Hebrew language model at a post-morphology level that can help with disambiguation • General Issues related to XFER framework: • Discriminative Language Modeling for MT • Effective models for assigning scores to transfer rules • Improved grammar learning • Merging/integration of manual and acquired grammars ISCOL/BISFAI-2007
Conclusions • Test case for the CMU XFER framework for rapid MT prototyping • Preliminary system was a two-month, three person effort – we were quite happy with the outcome • Core concept of XFER + Decoding is very powerful and promising for MT • We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar... ISCOL/BISFAI-2007
Questions? ISCOL/BISFAI-2007