160 likes | 263 Views
Patent documentation - comparison of two MT strategies. Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen loff@cst.dk, claus@cst.dk. A comparison of two different MT strategies. RBMT and SMT, similarities and differences, in a patent documentation context
E N D
Patent documentation -comparison of two MT strategies Lene Offersgaard, Claus Povlsen Center for Sprogteknologi, University of Copenhagen loff@cst.dk, claus@cst.dk
A comparison of two different MT strategies • RBMT and SMT, similarities and differences, in a patent documentation context • What requirements should be met in order to develop an SMT production system within the area of patent documentation? • The two strategies: • PaTrans: A transfer and rule based translation system, used the last 15 years at Lingtech A/S (Ørsnes, 1996). • SpaTrans: A SMT system based on the Pharaoh framework (Koehn, 2004). Investigations supported by Danish Research Council. • Subdomain: chemical patents MT-Summit, Sep 2007
A comparison of two different MT strategies -2 • PaTrans: Transfer and rule based • En-Da, linguistic development • Grammatical coverage tailored to the text type of Patents • Tools for terminology selection and coding • Handling of formulas and references • SpaTrans: An SMT system based on Pharaoh framework • En-Da, research version • Word and grammatical coverage determined by training corpus • No termilology handling yet • Simple handling of formulas and references MT-Summit, Sep 2007
Translation Workflow English Patent • SpaTrans: • Statistical • resources • PaTrans: • Linguistic resources Preprocessing lexicon Translation Engine Language model srilm 3 PaTrans Engine Pharaoh Decoder grammar Phrase table termbases Postprocessing Danish Patent Proff reading MT-Summit, Sep 2007
BLEU Evaluation • Reference translations are two post-edited PaTrans translations • The PaTrans system is favoured: term bases, wording and sentence structure • Some SpaTrans errors are caused by incomplete treatment of formulas and references • BLEU differs for the two patents • Very promising results for the SpaTrans system MT-Summit, Sep 2007
Human evaluation of the SMT system • Limited resources for manual evaluation • Proof readers have post-edited SMT output and focussed on: • Post editing time • Quality of output • Intelligibility (understandable?) • Fidelity (same meaning?) • Fluency (fluent Danish?) • Conclusions: • Usable translation quality • Both intelligibility and fidelity scores are best without reordering • Annoying agreement errors • New terms has to be included in the SMT system easily MT-Summit, Sep 2007
SpaTrans translation results • A dominant error pattern is the frequent occurrence of agreement errors in nominal phrases • Examples • Gender disagreement: • (lit:… control of the full spectrum) • … kontrol af den fulde spektrum … kontrol af den[DET_common_sing] fulde spektrum[N_neuter_sing] Corrected output: … kontrol af det[DET_neuter_sing] fulde spektrum[N_neuter_sing] MT-Summit, Sep 2007
SpaTrans translation results - 2 • Number disagreement: • (lit: … the active ingredients) • … den aktive bestanddele … den[DET_common_sing] aktive bestanddele[N_common_plur] Corrected output: … de[DET_common_plur] aktive bestandele[N_common_plur] Definiteness disagreement: (lit: ... this constant erosion) ... denne konstant erosion ... denne[DET_definite]konstant[ADJ_indefinite] erosion Corrected output: ... denne[DET_definite]konstante[ADJ_definite] erosion Lets give linguistic information a try! MT-Summit, Sep 2007
Adding linguistic information to SMT: MOSES • MOSES • Open source system replacing Pharaoh (Koehn et al. 2007) • State-of-the-art phrase-based approach • Using factored translation models • Comparison SpaTrans-Pharao and Moses decoder • Reuse of statistical resources • Pharao parameters for monotonic setup optimised based on development tests MT-Summit, Sep 2007
Adding linguistic information using MOSES • Using factored translation models • Makes it possible to build translation models based on surface forms, part-of-speech, morphology etc. • We use: • Translation model: word->word, pos->pos • Generation model determine the output Input Output word word pos+morf pos+morf MT-Summit, Sep 2007
Adding POS-tags and morphology • Pos-tagging training material: • Brill tagger used • Different tagsets for Danish and English text • Experiments with language model (lm) order • order 3 or 5 • Results not significant: • Test Patent A: +0.1% BLEU • Test Patent B: -0.1% BLEU • Perhaps training material too small to do lm order • experiments • Training parameters kept: phrase-length 3, lm order 3 • No tuning of parameters, just training. MT-Summit, Sep 2007
Results adding pos-tags – by inspection • With inclusion of morpho-syntactic information: • (lit:… control of the full spectrum) • ... kontrol af det fulde spektrum (gender agreement) • (lit: … the active ingredients) • ... de aktive bestanddele (number agreement) • (lit: ... this constant erosion) • ... denne konstante erosion (definiteness agreement) MT-Summit, Sep 2007
Results using pos-tags - BLEU BLEU not designed to test linguistic improvement, anyway: Significant improvement! MT-Summit, Sep 2007
ConclusionsMOSES • En-Da Patents: best results when no reordering • Agreement errors can be reduced by applying factored training using pos+mophology • Experiments using a ”language” model order > 3 for POS-tags might give even better results MT-Summit, Sep 2007
Conclusions SMT test results for patent text • Usable • translation quality comparable with RBMT systems in production • low cost development for new domain • possible to have SMT-systems tailored to different domains of patents - if training data are available • Patent texts always contain new terms/concepts • Therefore new terms have to be handled in SMT production systems • Agreement errors can be reduced by applying factored training with pos-information - BLEU score improved! MT-Summit, Sep 2007
Acknowledgements • Thanks! • The work was partly financed by the Danish Research Council. • Special thanks to Lingtech A/S and Ploughmann & Vingtoft for providing us with training material and proofread patents. MT-Summit, Sep 2007