610 likes | 760 Views
METIS ( Traducció Automàtica per a llengües amb pocs recursos ). Maite Melero (GLiCom – BM). Roadmap. METIS II (2004-2007) ES-EN approach (GLiCom) METIS II evaluation results Rapid deployment of METIS CA-EN pair. Current approaches to MT. In industry: mainly rule-based
E N D
METIS (Traducció Automàtica per a llengües amb pocs recursos ) Maite Melero (GLiCom – BM) Seminari NLP-UPC
Roadmap • METIS II (2004-2007) • ES-EN approach (GLiCom) • METIS II evaluation results • Rapid deployment of METIS CA-EN pair Seminari NLP-UPC
Current approaches to MT • In industry: mainly rule-based • require lots of expensive manual labour • In academia: mostly data driven (statistical and example-based MT) • require large parallel corpora • What happens with smaller languages? Seminari NLP-UPC
METIS II (2004-2007): the aims • Construct free text translations by • relying on hybrid techniques • employing basic resources • retrieving the basic stock for translations from large monolingual corpora of the target language only Seminari NLP-UPC
Similar approach: MATADOR • MATADOR (Habash and Dorr, 2002, 2003; Habash,2003, 2004). • Main difference: • MATADOR aims at language pairs with resource asymmetry: low resources for the source language, and high resources for the target language • METIS aims at low resources on both sides Seminari NLP-UPC
Metis II: The main ideas • Hybrid approach: strong data driven component plus a limited number of rules • Simple resources, readily available • Weights associated with resources and the search algorithm • TL corpus: processed off-line to construct TL model • Language-specific components independent from the core search engine • Special data format for the core engine input (UDF) • Several language pairs test feasibility of the approach: Dutch, German, Greek and Spanish English. Seminari NLP-UPC
METIS II architecture Seminari NLP-UPC
What are basic NLP resources? • Part-of-speech taggers • Lemmatizers • Manually corrected POS tagged corpus (can be used to train a statistical tagger such as TnT (Brants, 2000)) • (optionally) Chunkers Seminari NLP-UPC
Metis II: Fields of experimentation • SL analysis • depth & richness of syntactic structure • Transfer • which pieces / structures of information • Generation • re-ordering of chunks and words Seminari NLP-UPC
Metis II: SL Analysis (Morphology) All language pairs provide: • Lemmatisation: • abstraction from inflection • POS tagging: • verb, noun, adjectives, articles, pronouns, etc, with subclasses according to properties of SL • Nominal Inflection: • number, gender, case • Verbal Inflection: • number, person, tense, mood, type (ptc, fin, inf, etc.) Seminari NLP-UPC
Metis II: SL Analysis (Syntax) • No syntactic SL analysis: Spanish • Phrase detection (nominal, prepositional, verbal groups) and Clause detection (main and subordinate clause): Dutch, German & Greek • Recursive embedding of phrases and clauses: • one level, no embedding: German • two level embedding: Greek • full recursivity: Dutch • detect phrase & clause head: Dutch & Greek • subject detection: German & Greek • topological field analysis: German Seminari NLP-UPC
Metis II: Source Language Analysis • Provides generalization: • Smaller lexicon • Less data sparsity in TL corpus Seminari NLP-UPC
Metis II: Transfer (Mapping of SL features to TL) Seminari NLP-UPC
Metis II: TL Generation (Reordering) • Reordering of the transferred items into TL structure is conceived as a process of hypothesis generation and filtering, according to most likely TL pattern (from TL model). • Mostly pattern-based and use only info from TL,but • can also be partly rule-based and use information from SL (Dutch and German) Seminari NLP-UPC
Metis II: TL Generation (Reordering) • Information to be matched in TL model • Shallow syntactic information: all exc. Spanish • n-gram patterns of mapped Pos & lemmata: Spanish • Matching Procedure • top down: Greek • bottom up: all exc. Greek Seminari NLP-UPC
Metis II: Reordering mechanism for TL word order generation Seminari NLP-UPC
Metis II Spanish-English Translation Paradigm Spanish sentence POS tagger and lemmatizer Spanish Preprocessing Translation Model Bilingual flat lexicon (no structure transfer rules) English Generation Search over ngram models extracted from English corpus English sentence Seminari NLP-UPC
Main Translation Problems • Lexical selection: i.e. picking the right translation for a given word • escribir una carta write a letter • jugar una carta play a card • Translation divergences: i.e. whenever word-by-word translation does not work • ver a Juan see (to) Juan • cruzar nadando cross swimming (swim across) Seminari NLP-UPC
Translation Divergences. How MT has addressed them • Linguistic based MT systems devise data representations that minimize translation divergences. • [head] ver [head] see [arg2] Juan [arg2] Juan • Remaining divergences need to be solved in the translation module: • Hand written bilingual mapping rules (Transfer MT). • Mappings automatically extracted from parallel corpus (Example Based MT). Seminari NLP-UPC
Translation Divergences. Our constraints. • Very basic resources required, both for source and target languages: only lemmatizer-POS tagger and (TL) chunker. • No deep linguistic analysis to minimize divergences • No parallel corpus, only target corpus • Keep translation model very simple: only bilingual lexicon. • No mapping rules, either hand-written, or automatically learned. Seminari NLP-UPC
Translation Divergences. Our approach. • Handle structure modifications in the TL Generation component. • Treatment independent of the SL, i.e. much more general and reusable. Seminari NLP-UPC
SL Preprocessing (Spanish) Tagger (CastCG) Statistical disambiguation SL normalization Seminari NLP-UPC
Spanish Tagger: CastCG Me alojo en la casa de huéspedes. Seminari NLP-UPC
SL Normalization: Tag Mapping Seminari NLP-UPC
SL Normalization: e.g. Pronoun Insertion in Pro-drop Seminari NLP-UPC
Sp-Eng lexmetis HD HD Translation Model: Spanish-English Lexicon Look-up Oxford List of Pseudo-English candidates (UDF) Seminari NLP-UPC
Sp-Eng lexmetis Translation Model: Compound Detection <trans-unit id="6"> <option id="1"> <token-trans id="1"> <lemma>boarding</lemma> <pos>VVG</pos> </token-trans> <token-trans id="2"> <lemma>house</lemma> <pos>NN1</pos> </token-trans> </option> </trans-unit> Oxford casa => house casa => casa de huéspedes casa de huéspedes => boarding house Seminari NLP-UPC
Translation Model: Unfound words • Past participle Ex. “denominado” > denominar (VM) > designate (VV) > designated (AJ0) • Adverbs Ex. “técnicamente” > técnico (AQ) > technical (AJ0) > technically (AV0) Seminari NLP-UPC
TL Generation (English) Pseudo-English UDF Search Engine (TL models) English lemmatized sentence Token generation English translation Seminari NLP-UPC
n-gram n-gram n-gram Search Engine (1st version) Lexical preselection Candidate expansion TL models Candidate scoring Seminari NLP-UPC
the worker must carry helmet … n-gram n-gram n-gram wear bottle drive headphones helmet TL models Search Engine (2nd version): beam search decoding Search engine Lexical pre-selection Candidate expansion Scoring Seminari NLP-UPC
1-gram 2-gram 3-gram 4-gram 5-gram Target Language Models BNC 6 M sents stay|VV in|PRP the|AT0 house|NN TL Model subst. 1! position (for n>2) stay|VV in|PRP the|AT0 NN Seminari NLP-UPC
3-gram 5-gram 1-gram 2-gram 4-gram Handling Structure Divergences in TL Generation: Local Structure Modifications n freq want|VV go|VV want|VV to|TO0 go|VV I at|PRP the|AT0 home|NN at|PRPhome|NN D n freq • Insertion of functional words: want|VV to|TO0 go|VV • Deletion of functional words: at|PRP (the|AT0) home|NN • Permutation of content words: a|AT0 {day|NN happy|AJ0} Seminari NLP-UPC
Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least 35 words which translate to at least to English words. Thus: Seminari NLP-UPC
Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step: Suppose we are given a source sentence with at least 35 words which translate to at least to English words. Thus: The search space of candidates must bepruned. Seminari NLP-UPC
Search Engine: beam search decoding • Performance problems • Combinatorial explosion in the expansion step • Combinatorial explosion in the scoring computation step. Seminari NLP-UPC
Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) Seminari NLP-UPC
Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) w1,…,wk are pushed on the first stack. The stack is ranked and pruned up to a given stack depth Seminari NLP-UPC
Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) Each candidate of (i-1)-th stack is expanded via the dictionary and edit ops. Again Candidates are ranked and pruned up to given stack depth. Seminari NLP-UPC
Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) The scoring of each partial translations is computed using the already computed stored scorings: Seminari NLP-UPC
Search Engine: beam search decoding • Solution: • To incrementally build the search space (following Philipp Koehn’s Pharaoh: aBeam Search Decoder for Phrase-Based Statistical Machine Translation Models) (2004) • At the N-th step (the source sentence contains N tokens) the decoding process stops. We get a ranked stack with the translation candidates. Seminari NLP-UPC
Handling Structure Divergences in TL Generation: Non-local Movements Normalized BNC Syntactic model Boundaries PRPAT0NN VV AT0NN AT0NN PRPAT0NN VV AT0NN VV PRPAT0NN Chunked Corpus eg. [The man] [sleeps] [at the park] Seminari NLP-UPC
Evaluation final METIS prototype • Comparison with SYSTRAN: • Widely used • Available for all language pairs • Rule-based, many man-years of development • Goal: get an estimation of what has been achieved Seminari NLP-UPC
Methodology: Test sets • Two test sets: • 200 sentences manually chosen from Europarl • 200 sentences from balanced test suite used to validate system development from a variety of domains: • 25% grammatical phenomena • 25% newspapers • 25% technical • 25% scientific Seminari NLP-UPC
Methodology: Metrics • BLEU & NIST: measure edit distance using Ngrams • TER (Translation Error Rate): measures the amount of editing that a human would have to perform Seminari NLP-UPC
Methodology: References • All metrics use human created references to compare with MT output. • Europarl: 5 references (4 resulting from human translating each SL into English + original English one) • Development: 3 references Seminari NLP-UPC
METIS-II SYSTRAN difference % NL-EN 0.1925 0.3828 0.1903 50% DE-EN 0.2816 0.3958 0.1142 71% EL-EN 0.1861 0.3132 0.1271 59% ES-EN 0.2784 0.4638 0.1854 60% Results on Europarl test set BLEU Seminari NLP-UPC
METIS-II SYSTRAN difference % NL-EN 0.2369 0.3777 0.1408 70% DE-EN 0.2231 0.3133 0.0902 71% EL-EN 0.3661 0.3946 0.0285 92% ES-EN 0.2941 0.4634 0.1693 63% Results on development test suite BLEU Seminari NLP-UPC
Europarl Dev difference NL-EN 0.1925 0.2369 0.0444 DE-EN 0.2816 0.2231 -0.0585 EL-EN 0.1861 0.3661 0.1800 ES-EN 0.2784 0.2941 0.0157 METIS-II on both test sets BLEU Seminari NLP-UPC
Grammar News Science Tech METIS-II 0.22 0.33 0.29 0.26 SYSTRAN 0.48 0.46 0.47 0.45 Results according to text type (ES-EN on Development Testsuite) BLEU Seminari NLP-UPC