1 / 14

Enhanced Infrastructure for Creation & Collection of Translation Resources

Enhanced Infrastructure for Creation & Collection of Translation Resources. Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda. Introduction. LDC develops large scale parallel text corpora for sponsored research programs Manual creation of parallel text by human translators

Download Presentation

Enhanced Infrastructure for Creation & Collection of Translation Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda

  2. Introduction • LDC develops large scale parallel text corpora for sponsored research programs • Manual creation of parallel text by human translators • Harvesting, aligning potential parallel documents from known repositories and the web • Recent expansion in scope and variety • Requiring improvements in quality, efficiency and cost-effectiveness

  3. Context for Resource Creation • Previous focus primarily Chinese, Arabic newswire (NW) • Current focus on "unstructured" data • Broadcast News (BN) and Broadcast Conversation (BC) • Weblogs, Newsgroups (WB) • Handwritten document images of many types (VAR) • New linguistic varieties • Eight language pairs in the LCTL program • Colloquial Arabic varieties for some projects • New evaluation requirements • Multiple human translations, adjudication of multiple translations • Translation alternatives for ambiguous source text • Translation post-editing

  4. Recent translation efforts

  5. Manual Translation Pipeline datapool source text translation select audio transcription and segmentation validate release package QC segment into sentence units convert to translator- friendly format select text convert torelease format selected web data translated text

  6. Manual Translation • Commercial agencies vetted, trained by LDC • Required to use LDC's project-specific guidelines • Accuracy and fidelity over fluency • General principles, language-specific requirements • Rules for named entities, disfluencies, emoticons, etc. • Requirements for formatting and validation • Multiple examples of preferred translation • Separate guidelines for specialized tasks • Post-editing machine translation output • Translation alternatives • Translation of novel single sentences • Translation of handwritten document images

  7. Translation QC • All translations undergo additional QC at LDC • Typically 10% of training data, 100% of evaluation data reviewed • Standardized QC rating system deducts points for each type of error • QC report including score, examples sent to translators • Failing score requires re-translation of full data set • QC process facilitated by customized TransQC GUI

  8. QCTrans GUI

  9. Translation Project Management • Translation database is core management tool • Document ID, language, genre, token count, LDC file server path • Data set information including project, phase, partition, restrictions • Translator assignment, due date, status, QC score, payment info • Backend to LDC Translator Extranet • Translators access and submit assignments, validate submissions, view QC reports, generate invoices, check payment status • Queries support status tracking but also assignment generation, data selection, cross-project coordination • What translation assignments are pending delivery this week? • What is average QC score for this translator on Chinese BC? • List Arabic NW files from 2007 that have never been released as GALE training data and are not part of any project's eval set

  10. LDC Translation Database

  11. Parallel text harvesting • Manual translation supplemented by harvesting and alignment of potential parallel text • Harvest text from multilingual sites • E.g. newswire providers • Standardize markup format • Use BITS document mapping module to find likely parallel documents • Use Champollion to find sentence alignments • High yields in GALE program • 82,000 Arabic-English document pairs • 67,000 Chinese-English document pairs

  12. Conclusion • Robust, flexible translation infrastructure to support multiple, distinct, concurrent projects • Much of this infrastructure freely available from LDC • Task specifications, guidelines available for all projects • http://projects.ldc.upenn.edu/gale/Translation/ • QCTrans GUI slated for free, open-source distribution • Many resulting parallel text corpora already in LDC Catalog • Newly emerging data sets to be added over time

  13. Recent corpora

  14. Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

More Related