360 likes | 668 Views
. . Post-Editing of MT Output in a Production Setting : Experiences at the Pan American Health Organization Julia Aymerich & Hermes Camelo. MT Post-Editing. “correction of machine translation output by human linguists/editors” (Dorothy Senez, 1998) Can it be automated?.
E N D
. . Post-Editing of MT Output in a Production Setting : Experiences at the Pan American Health Organization Julia Aymerich & Hermes Camelo Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
MT Post-Editing • “correction of machine translation output by human linguists/editors” (Dorothy Senez, 1998) Can it be automated? Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Automated Post-Editing • Detection of intelligibility errors • Categorization of intelligibility errors • Automated correction of intelligibility errors • Techniques for automated post-editing • Techniques for evaluating post-edited improvement, including human acceptance ratings of automatically post-edited output Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHO Translation Services • Provide translation services into English, Spanish, Portuguese, and French for HQ units • 8 staff members: • 1 chief • 1 Spanish translator • 1 English translator • 2 computational linguists • 3 office assistants • Roster of over 100 free-lance translators • MT licensed to over 180 sites throughout the world • MT developed and used in-house over 25 years Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® History 1976 First contract: SPANAM 1979 In-house development began 1980 SPANAM on mainframe Post-editing macros for the Wang word processor Feedback provided in writing on side-by-side printout 1985 ATN parser created ENGSPAN on mainframe Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® History 1991 Post-editing macros for WordPerfect 1992 From mainframe to PC on the PAHO LAN 1996 Post-editing macros for Microsoft Word 2000 32-bit Windows version 2003 Portuguese-English; English-Portuguese 2004 Portuguese-Spanish; Spanish-Portuguese Feedback provided electronically on side-by-side file Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® History 2004 Aligned corpus started 2005 Post-editing macros for PowerPoint Editing macros from WHO 2006 Synchronized post-editing Enhanced corpus alignment 2007 Automatic feedback from tri-column side-by-side Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Translation request from HQ Unit Workflow Job assignment Processable with MT? TM available? NO YES YES Human translation NO MT/TM translation MT processing Post-editing Feedback/Synchronization In-house revision MT enhancements Delivery Bilingual corpus Translation Tracking System Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Processing: is MT Appropriate? • Appropriate in 90% of cases • manuals, reports, proposals, abstracts, scientific articles, position papers, PowerPoint presentations • Not appropriate if: • Target language is French • Source document cannot be converted (e.g. PDF file with graphics only) • Hard copy only and quality not good enough for OCR • Too idiomatic: personal correspondence, dialogue, scripts, poetry, personal chat Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
MT Processing: Checklist • Perform a spelling check • Macro to spellcheck the document using the PAHOMTS® dictionaries. If too many words are not in the dictionaries, they are added before the job is run. • For large documents, before running the job, we extract terminology from bilingual corpus and feed it into the dictionaries. • Check language code • Incorrect hard returns • Punctuation in lists • Text boxes, embedded objects, and drawings No Pre-Editing! Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
MT Processing: Checklist • Sections blocked for translation • bibliographic references • text in another language • lists of names/addresses • tables with numbers only • independent words (The SMILE program) • Check consistency: if two different styles and vocabulary, divide in two jobs or activate different grammars Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Machine Translation Processing • Rules for particular domains or styles are selected before the MT process begins • Type of Grammar: • Abstract, letter, manual, report, resolution, survey, speech, post description, news article, summary record • One type of grammar only • Specialized Vocabulary: • Medical research, financial, environment, equipment, agriculture, patient education, legal, computer science, United Nations, radiation, pharmaceutical, statistics, European variety • Several microglossaries may be activated, in order of priority Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
MT Processing (cont.) • Assistant verifies that MT did an acceptable job • If too many not-found words, they are added to the dictionaries and the document is retranslated. • If percentage of complete parses is too low (less than 60%), the document is rechecked for formatting and/or spelling errors. • Occasionally, very poorly written or formatted documents are returned to the requesting unit. • If the output is acceptable, post-edit it ! Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Post-Editing • Get the big picture first on screen • Freelancers don’t have access to PAHOMTS®; only the output files. • Done directly on the RAW file or on the side-by-side file • Using the side-by-side file as a reference and to provide feedback • Making use of the MS Word / PowerPoint post-editing macros Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Post-Editing: Training Translators • Post-edit on screen, not on paper • Insist on referring to original document • Use the side-by-side and mark: • errors • not-found words • preferred translations • When researching a translation, mark as highly reliable Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Post-Editing: Non-Translators • Teach them what to fix in the source text. Give them a list of trouble spots to look for: • text that should not be translated • acronyms • formatting problems • Insist on careful post-editing: • non-translators tend to trust the raw output too much and overlook errors • Give them examples of embarrassing MT errors left uncorrected • Insist on providing terminological feedback Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Feedback • Stylistic preferences should only be entered on the actual translation. • Post-editors work with dictionary coders • Provide feedback in the side-by-side file • Good feedback: • All new words • Official names of organizations (with reference) • Erroneous dictionary entries • Preferred alternate glosses • Wordfast: Translators provide feedback via e-mail using the source and target segments. Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Seriousness of Errors • Easily fixed lexical errors • Some target constructions not generated correctly • Some source constructions not parsed correctly • Total lack of parsing Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Techniques for Automated Post-Editing • Post-editing macros • Complete on-screen editing • Linked source and target segments • Editing and style macros from WHO Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: • Width adjustment • Search & Replace • Browse PAHOMTS® dictionary • Move word left • Move word right • Delete word • Lowercase • Uppercase • Document cleanup Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: • Delete definite article • Change next found “its” to “their” • Create possessive • Delete and switch • Create Noun-Noun compound • Undelete “of” • Serial comma Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: • Pluralize and go to next • Singularize and go to next • Feminine/Masculine • Smart delete of next definite article • Adjective > Adverb in -mente • Add diacritic Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: • Pluralize and go to next • Singularize and go to next • Feminine/Masculine • Smart delete of next definite article • Adjective > Adverb in -mente • Add diacritic • Clitic movement Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
PAHOMTS® Editing Tools • Microsoft Word / PowerPoint macros: • Display source segment • Provide feedback • Transfer segments from translated document into side-by-side (and vice-versa) • Clean-up synchronization marks • Create tri-column side-by-side Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Synchronized Post-Editing • Source and Target segments are linked • Post-editor can easily provide feedback after each segment, if appropriate • Parallel tri-column side-by-side • source text • MT output • post-edited text Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Synchronized Post-Editing • Side effects: • Parallel text can be used to extract editing rules to train post-editors • RAW and Final columns can be used to extract grammar fixes and dictionary entries to enhance the MT output • Source and Final columns are perfectly aligned matches that can be automatically exported into any translation memory / bilingual corpus Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Tri-Column Side-by-Side handouts Statistics for Document SE0403 – Agenda de salud • Preliminary classification of changes • Done manually • 65% of the segments were changed • 73% of these changes can be fixed in the MT dictionaries and/or MT program • Automatic classification of changes • We’re still working on it Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Tri-Column Side-by-Side handouts Preliminary Classification of Changes (Document SE0403) can be fixed cannot be fixed 73% 27% • Stylistic Changes (11%) • Deletions (5%) • POS Changes (4%) • Punctuation (3%) • Phrase Order (3%) • Sentence Split (1%) • Lexical Changes (45%) • Articles (13%) • NN Compounds (9%) • Word order (5%) • Verb Tense (1%) Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Editing Macros from WHO • Editing is not only for MT output • Styleguides for authors and translators • Search for terms or expressions on the Internet (Google, WHOLIS) and desktop applications (Oxford SuperLex, dtSearch) from Word • Search for synonyms or related terms recorded in an institutional database • Provide feedback to dictionary coders • Detect strings repeated in the surroundings • Fast word find Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
Improving MT Quality • Where can problem be solved? SOURCE TEXT Author, typist, staff DICTIONARY Subject-area specialists, dictionary coders POST-EDITING Translator, subject-area specialist, TM, SMT on the translation ALGORITHM Computational linguists Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
MT Enhancements • Daily • Lists of not-found words • Lexical issues pointed out by translators (SBS file) • SBS files examined by computational linguists to improve parsing/synthesis • Occasionally • Incorporation of bilingual glossaries using the Import or Merge utilities • Research into specific linguistic issues, always using bilingual corpus (ex: def. articles, questions) • Automatic terminology extraction from aligned corpus using MultiTrans Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
So, Does Post-Editing Work? • It works for us ! • Post-editing (= improving MT output) happens in many different stages of machine translation • input document • translation options that change MT output • automated tools for post-editing • human feedback • automatically extracted feedback • It never ends ! Automated Post-Editing Workshop Cambridge, MA – 12 August 2006
What’s Next? • Fine tuning of algorithm that makes full use of tri-column side-by-side • Automatic terminology extraction from bilingual corpus Automated Post-Editing Workshop Cambridge, MA – 12 August 2006