Hybrid Data-Driven Models of Machine Translation

Hybrid Data-Driven Models of Machine Translation Andy Way (& Declan Groves) National Centre for Language Technology, School of Computing, Dublin City University, Dublin 9, Ireland away@computing.dcu.ie Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Outline • Motivations • Example-Based Machine Translation • Marker-Based EBMT • Statistical Machine Translation • Experiments: • Language Pairs & Corpora Used • EBMT and PBSMT baseline systems • Hybrid System Experiments • Making use of merged data sets • ‘Phrases’, ‘Chunks’ and Training-Test Corpora • Conclusions • Future Work Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Motivations • Most MT research carried out today is corpus-based: • Example-Based Machine Translation (EBMT) • Statistical Machine Translation (SMT) • Lack of comparative research: • Relative unavailability of EBMT systems • Lack of participation of EBMT researchers in competitive evaluations • Dominance of the SMT approach Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Example-Based Machine Translation • As with SMT, EBMT makes use of information extracted from sententially-aligned bilingual corpora. In general: • SMT only uses parameters, throws away data • EBMT makes use of linguistic units directly • During Translation: • Source side of bitext is searched for close matches • Source-target subsentential links are determined • Relevant target fragments retrieved and recombined to derive final translation. Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT: An Example • Assumes an aligned bilingual corpus of examples against which input text is matched • Best match is found using a similarity metric based on word co-occurrence, POS, generalized templates and bilingual dictionaries (exact and fuzzy matching) Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT: An Example • Identify useful fragments Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT: An Example • Identify useful fragments • Recombination depends on nature of examples used on Monday lundi John went to Jean est allé à the baker’s la boulangerie Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT at DCU Marker-Based EBMT at DCU • Gaijin: [Veale & Way], RANLP ‘97 • [Gough et al.], AMTA ‘02 • wEBMT: [Way & Gough], Comp. Linguistics ‘03 • [Gough & Way], EAMT ‘04 • [Way & Gough], TMI ‘04 • [Gough], PhD Thesis ‘05 • [Way & Gough], Natural Language Engineering ‘05 • [Way & Gough], Machine Translation ‘05 • [Groves & Way], ACL w/shop on Data-Driven MT ‘05 • [Groves & Way], Machine Translation & EAMT ‘06 • MaTrEx: [Armstrong et al.], TC-STAR OpenLab ‘06 • [Stroppa et al.], NIST MT-Eval ‘06, AMTA ’06, IWSLT-06 Andy Way, IGK Summer School, Edinburgh, Sept. 2006

System Development Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” [Green, 1979] • Universal psycholinguistic constraint: languages are marked for syntactic structure at surface level by closed set of lexemes or morphemes The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant. Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” [Green, 1979] • Universal psycholinguistic constraint: languages are marked for syntactic structure at surface level by closed set of lexemes or morphemes The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant. • Three NPs start with determiners, one with a possessive pronoun • Nominal element will appear soon to the right • Sets of determiners and possessive pronouns small and finite Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT “The Marker Hypothesis states that all natural languages have a closed set of specific words or morphemes which appear in a limited set of grammatical contexts and which signal that context.” [Green, 1979] • Universal psycholinguistic constraint: languages are marked for syntactic structure at surface level by closed set of lexemes or morphemes The Dearborn Mich., energy company stopped paying a dividend in the third quarter of 1984 because of troubles at its Midland nuclear plant. • Four prepositional phrases, with prepositional heads • NP object will appear soon to the right • Set of prepositions small and finite Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT: Chunking • Use a set of closed-class marker words to segment aligned source and target sentences during a pre-processing stage • <PUNC> now used as end-of-chunk marker • English Marker words extracted from CELEX Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT: Chunking (2) • Enables the use of basic syntactic markup for extraction of translation resources • Source-target sentence pairs are tagged with marker categories in pre-processing stage • EN: <PRON> you click apply <PREP> to view <DET> the effect • <PREP> of <DET> the selection • FR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour • visualiser <DET>l’ effet <PREP> de <DET> la sélection • Aligned source-target chunks created by segmenting sentences based • on these marker tags along with cognate and word co-occurrence • information: • <PRON> you click apply : <PRON> vous cliquez sur appliquer • <PREP> to view : <PREP> pour visualiser • <DET> the effect : <DET> l’effet • <PREP> of the selection : <PREP> de la sélection Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT: Chunking (2) • Enables the use of basic syntactic markup for extraction of translation resources • Source-target sentence pairs are tagged with marker categories in pre-processing stage • EN: <PRON> you click apply <PREP> to view <DET> the effect • <PREP> of <DET> the selection • FR: <PRON> vous cliquez <PRON> sur appliquer <PREP> pour • visualiser <DET>l’ effet <PREP> de <DET> la sélection • Aligned source-target chunks created by segmenting sentences based • on these marker tags along with cognate and word co-occurrence • information: • <PRON> you click apply : <PRON> vous cliquez sur appliquer • <PREP> to view : <PREP> pour visualiser • <DET> the effect : <DET> l’effet • <PREP> of the selection : <PREP> de la sélection • Chunks must contain at least one non-marker word—ensures chunks contain • useful contextual information Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT: Lexicon & Template Extraction • Chunks containing only one non-marker word in both source and target languages can then be used to extract a word-level lexicon: <PREP> to: <PREP> pour <LEX> view: <LEX> visualiser <LEX> effect: <LEX> effet <DET> the: <DET> l <PREP> of: <PREP> de • In a final pre-processing stage, we producea set of generalized marker templates by replacing markerwords with theirtags: <PRON> click apply : <PRON> cliquez surappliquer <PREP> view : <PREP> visualiser <DET> effect : <DET> effet <PREP> the selection : <PREP> la sélection • Any marker word pair can now be inserted at the appropriate tag location. • More general examples add flexibility to the matching processand improve coverage (and quality) Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Marker-Based EBMT • During translation: • Resources are searched from maximal (specific source-targetsentence-pairs) to minimal context (word-for-word translation). • Retrieved example translation candidates are recombined, along with their weights, based on source sentence order • System outputs n-best list of translations Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Phrase-Based SMT • SMT translation and language models now make use of phrase-translations in TM, along with word correspondences, to improve translation output. • Better modelling of syntax and local word-reordering • Phrase extraction heuristics based on word alignments shown to be better than more syntactically motivated approaches [Koehn et al., 2003] • Perform word alignment in both source-target and target-source directions • Take intersection of unidirectional alignments • Extend the intersection iteratively into the union by adding adjacent alignments within the alignment space [Och & Ney 2003, Koehn et al., 2003]. • Extract all possible phrases from sentence pairs which correspond to these alignments • Phrase probabilities can be calculated from relative frequencies Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Outline: Recap • Motivations • Example-Based Machine Translation • Marker-Based EBMT • Statistical Machine Translation • Experiments: • Language Pairs & Corpora Used • EBMT and PBSMT baseline systems • Hybrid System Experiments • Making use of merged data sets • ‘Phrases’, ‘Chunks’ and Training-Test Corpora • Conclusions • Future Work Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Experiments Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT vs. WB-SMT • [Way & Gough, 05] (cf. talk here in May 05): on 203K-$ Sun TM (4.8M words), and a 4K-$ test set (ave. $-length 13.1 words EN, 15.2 words FR), EBMT>vanilla WB-SMT (Giza++, CMU-Cambridge statistical toolkit, ISI ReWrite Decoder) for FREN • Best BLEU scores: • ENFR: .453 EBMT, .338 WB-SMT • FREN: .461 EBMT, .446 WB-SMT Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT & PB-SMT (on Sun TM) English-French • The Phrase-Based system using GIZA-Data outperforms the same system seeded with EBMT-Data on all metrics, bar Precision (0.6598 vs. 0.6661) • Marker-Based EBMT system beats both Phrase-Based SMT systems, particularly for BLEU (0.4409 vs. 0.3758) and Recall (0.6877 vs. 0.5759). Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT & PB-SMT (on Sun TM) French-English • Scores for all systems are better for FREN than for ENFR • Again, the Phrase-Based system using GIZA data outperforms the same system seeded with EBMT data. • As for ENFR, the Marker-Based EBMT system significantly outperforms both Phrase-Based SMT systems for FREN. Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Towards Hybridity • Decided to merge data sources • Combine parts of EBMT sub-sentential alignments with parts of the data induced using GIZA++ • Performed a number of experiments using: • EBMT Phrases + GIZA++ Words (SEMI-HYBRID) • Investigate if quality of EBMT phrases is better than GIZA++ phrases • All Data (HYBRID); GIZA++ Words & Phrases + EBMT Words & Phrases • EBMT phrases will be used instead of SMT n-grams • EBMT phrases should add extra probability to ‘more useful’ SMT phrases; i.e. the probabilities of the phrases in the intersection of these two sets are boosted Giza++ Phrases EBMT Phrases Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Merging Data Sources: ENFR Results • Using EBMT phrases + GIZA words improves significantly on using EBMT data alone • Merging all the EBMT and GIZA data improves on all metrics, most significantly for BLEU score (0.4259 vs. 0.3643 SEMI-HYBRID). • EBMT system still wins out for BLEU score, Recall and WER Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Merging Data Sources: FREN Results • Using EBMT phrases + GIZA words shows improvements on PBSMT system seeded with EBMT data, but improves only on the GIZA seeded system’s BLEU score (0.4888 vs. 0.4198). • However, merging all data improves on both PBSMT systems on all metrics • EBMT system beats Hybrid system only on Recall and WER Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Results: Discussion • PBSMT • Best PBSMT BLEU scores (with Giza++ data only): 0.375 (E-F), 0.420 (F-E); • Seeding PBSMT with EBMT data gets good scores: for BLEU, 0.364 (E-F), 0.395 (F-E); note differences in data size (1.73M vs. 403K) • PBSMT loses out to EBMT system • Semi-Hybrid System • Seeding Pharaoh with SMT words and EBMT phrases improves over baseline Giza++ seeded system; • Data size diminishes considerably (430K vs. 1.73M); • Worse results than for EBMT system. • Fully-Hybrid System • Better results than for ‘semi-hybrid’ system: E-F 0.426 (0.396), F-E 0.489 (0.427); • Data size increases to 2.04M phrase table entries • For F-E, Hybrid system beats EBMT on BLEU (0.4888 vs. 0.4611) & Precision (0.6927 vs. 0.6782); EBMT ahead for Recall & WER. Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT & PB-SMT (on Europarl) • [Groves & Way, 06a/b] • Added SMT-chunks to EBMT system  hybrid ‘statistical EBMT’ system • New domain: Europarl (FREN, 322K-$ ) [Koehn, 05] • Extracted training data from designated training sets, filtering based on sentence length and relative sentence length (ratio of 1.5 used). • Allowed us to extract high-quality training sets • For testing, randomly extracted 5000 sentences from the Europarl common • test set. Avg. sentence lengths: 20.5 words (French), 19.0 words (English) Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT vs. PBSMT • Compared the performance of our Marker-Based EBMT system against that of a PB-SMT system built using: • Pharaoh Phrase-Based Decoder [Koehn, 04] • SRI LM toolkit [Stolcke, 02]. • Refined alignment strategy [Och & Ney, 03] • Trained on incremental data sets, tested on 5000 sentence test set • Effect of increasing training data on translation quality • Performed translation for FREN • Evaluated translation quality automatically using BLEU [Papineni et al., 02], Precision & Recall (GTM toolkit [Turian et al., 03]) and Word-error rate (WER) Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT vs. PBSMT: French-English • Doubling the amount of data improves performance across the board for both EBMT and PBSMT • PBSMT system clearly outperforms EBMT system, on average achieving 0.07 BLEU score higher • PBSMT achieves a significantly lower WER (e.g. 68.55 vs. 82.43 for the 322K data set) • Increasing amount of training data results in: • 3-5% increase in relative BLEU for PBSMT • 6.2% to 10.3% relative BLEU score improvement for EBMT 78K 156K 322K Andy Way, IGK Summer School, Edinburgh, Sept. 2006

EBMT vs. PBSMT: English-French • PBSMT continues to outperform EBMT system by some distance • e.g. 0.1933 vs. 0.1488 BLEU score, 0.518 vs. 0.4578 Recall for 322K data set • Difference between systems is somewhat less for ENFR than for FREN • EBMT system performance much more consistent for both directions • PBSMT system performs 2% BLEU score worse (10% relative) for ENFR than for FREN • French-English is ‘easier’ • Fewer agreement errors, problems with boundary friction e.g. le the (FREN), the le, la, les, l’ (ENFR) • EBMT scores higher for ENFR than for FREN in terms of BLEU score • Cf. [Callison-Burch et al., 06], BLEU for evaluating non-n-gram-based systems 78K 156K 322K Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid System Experiments • Decided to merge elements of EBMT marker-based alignments with PBSMT phrases and words induced via GIZA++ • Number of Hybrid Systems • LEX-EBMT: Replaced EBMT lexicon with higher quality PBSMT word-alignments, to lower WER • H-EBMT vs. H-PBSMT: Merged PBSMT words and phrases with EBMT data (words and phrases) and passed resulting data to baseline EBMT and baseline PBSMT systems • H-EBMT-LM: Reranked the output of H-EBMT systems using the PBSMT system’s equivalent language model Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Experiments: French-English Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Experiments: French-English • Use of the improved lexicon (LEX-EBMT), leads to only slight improvements (average relative increase of 2.9% BLEU) • Adding Hybrid data improves above baselines, for both EBMT (H-EBMT) and PBSMT (H-PBSMT) • H-PBSMT system achieves higher BLEU score trained on 78K & 156K compared with PBSMT system when trained on twice as much data. • The addition of the language model to the H-EBMT system helps guide word order after lexical selection and thus improves results further Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Experiments: English-French • We see similar results for ENFR as for FREN • The more SMT-like the EBMT system becomes, the more the BLEU scores fall in line with other metrics, i.e. higher for FREN than for ENFR • Using the hybrid data set we get a 15% average relative increase in BLEU score for the EBMT system, and 6.2% for the H-PBSMT system over its baseline • The H-PBSMT system performs almost as well as the baseline system trained on over 4 times the amount of data Andy Way, IGK Summer School, Edinburgh, Sept. 2006

SMT ‘phrases’ vs. EBMT ‘chunks’ • Many more SMT phrases are derived than EBMT chunks • Not reflected in scores • Doubling amount of data, doubles amount of sub-sentential alignments for both systems • Indicates the heterogeneous nature of the Europarl corpus • Taking the 322K training set : • 93.0% SMT chunks found only once, 99.4% occur < 10 times • 96.6% EBMT chunks found only once, 99.8% occur < 10 times • Of the top 10 most frequent chunks in SMT-only set, 7 are made up solely of marker words: du  of the de la  of the union européenne  union états membres  member states de l  of the dans le  in the n est  is parlement européen  parliament que nous  that we que la  that the Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Remarks • [Groves & Way, 05] showed how an EBMT system outperforms a PBSMT system when trained on the Sun Microsystems’ data set • This time around, the baseline PBSMT system achieves higher quality than all variants of the EBMT system • Heterogeneous Europarl vs. Homogeneous Sun data • Chunk coverage is lower on Europarl data set: 6% translations produced using chunks alone (Sun) vs. 1% on Europarl • EBMT system considered 13 words on average for direct translation (vs. 7 for Sun data) • Significant improvements seen when using higher-quality lexicon • Improvements also seen when LM introduced • H-PBSMT system able to outperform baseline PBSMT system • Further gains to be made from hybrid corpus-based approaches • Small overlap on chunks extracted via EBMT and SMT methods Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid ‘Example-Based SMT’: The MaTrEx system Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Example-Based SMT • [Armstrong et al., 06]: OpenLab MT-EVAL (March 06)—adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 4 BLEU points for ESEN • [Stroppa et al., 06]: adding EBMT chunks to ‘vanilla Pharaoh’ PB-SMT system adds about 5 BLEU points for BasqueEN • Good performance in IWSLT-06 Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Outline: Recap • Motivations • Example-Based Machine Translation • Marker-Based EBMT • Statistical Machine Translation • Experiments: • Language Pairs & Corpora Used • EBMT and PBSMT baseline systems • Hybrid System Experiments • Making use of merged data sets • ‘Phrases’, ‘Chunks’ and Training-Test Corpora • Conclusions • Future Work Andy Way, IGK Summer School, Edinburgh, Sept. 2006

‘Phrases’, ‘Chunks’ and Training-Test Corpora • SMT phrases are contiguous sequences of n-grams • Typically, EBMT performance is comparable with PB-SMT with fewer sub-sentential alignments • As EBMT chunks are different from SMT ‘phrases’, use them if available in your PB-SMT systems (cf. OpenLab ESEN and AMTA BasqueEN results). They: • Provide longer sequences of context  better translations • Reinforce probability of good but infrequent SMT ‘phrases’ • As SMT ‘phrases’ are different from EBMT chunks, use them if available in your EBMT systems • SMT ‘phrases’ typically shorter than EBMT chunks, so more useful where training/test material is more heterogeneous—where EBMT chunks are ‘too long’ to cover the input data, SMT n-grams can fill in before we need to resort to W2W translation (always last resort) • cf. CMU findings in recent NIST MT-Eval … Andy Way, IGK Summer School, Edinburgh, Sept. 2006

‘Phrases’, ‘Chunks’ and Training-Test Corpora • Looks like EBMT better on homogeneous training data: • EBMT > PB-SMT on Sun TM (ENFR) • EBMT > PB-SMT on EF TM (BasqueEN) • SMT better on (more) heterogeneous data • PB-SMT > EBMT on Europarl (ENFR) • Predictors of Usefulness of Approach given Text Type: • Chunk coverage • Amount of W2W Translation Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Conclusions • Combining SMT ‘phrases’ and EBMT chunks in a hybrid ‘statistical EBMT’ or ‘example-based SMT’ system will improve your system output • Blind adherence to one approach will guarantee that your performance is less than it could otherwise be • John Hutchins: “EBMT is Hybrid MT” • Joe Olive: “Need combination of ‘rules’ and statistics” Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Ongoing & Future Work • Automatic detection of Marker Words • Most common SMT phrases consist mainly of marker words • Plan to increase levels of hybridity • Code a simple EBMT decoder, factoring in Marker-Based recombination approach along with probabilities • Use exact sentence matching in PBSMT, as in EBMT • Integration of generalized templates into PBSMT system (and reintegrate them into EBMT system) • Integrate marker tag information into SMT language and translation models • Hybrid EBMT-EBMT System (with CMU)?! • What’s the contribution of EBMT chunks if an SMT system is allowed as much training data as it likes? Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Thank you for your attention. Andy Way, IGK Summer School, Edinburgh, Sept. 2006

Hybrid Data-Driven Models of Machine Translation

Hybrid Data-Driven Models of Machine Translation

Presentation Transcript

Statistical XFER: Hybrid Statistical Rule-based Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Large Language Models in Machine Translation

Chapter 8: Data-Driven Models

Deep Linguistic Information in Hybrid Machine Translation

Machine Translation

Deep Grammars in Hybrid Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Data-Driven Machine Translation for Sign Languages

Machine Translation

Machine Translation

Deep Grammars in Hybrid Machine Translation