140 likes | 243 Views
Enhanced Infrastructure for Creation & Collection of Translation Resources. Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda. Introduction. LDC develops large scale parallel text corpora for sponsored research programs Manual creation of parallel text by human translators
E N D
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda
Introduction • LDC develops large scale parallel text corpora for sponsored research programs • Manual creation of parallel text by human translators • Harvesting, aligning potential parallel documents from known repositories and the web • Recent expansion in scope and variety • Requiring improvements in quality, efficiency and cost-effectiveness
Context for Resource Creation • Previous focus primarily Chinese, Arabic newswire (NW) • Current focus on "unstructured" data • Broadcast News (BN) and Broadcast Conversation (BC) • Weblogs, Newsgroups (WB) • Handwritten document images of many types (VAR) • New linguistic varieties • Eight language pairs in the LCTL program • Colloquial Arabic varieties for some projects • New evaluation requirements • Multiple human translations, adjudication of multiple translations • Translation alternatives for ambiguous source text • Translation post-editing
Manual Translation Pipeline datapool source text translation select audio transcription and segmentation validate release package QC segment into sentence units convert to translator- friendly format select text convert torelease format selected web data translated text
Manual Translation • Commercial agencies vetted, trained by LDC • Required to use LDC's project-specific guidelines • Accuracy and fidelity over fluency • General principles, language-specific requirements • Rules for named entities, disfluencies, emoticons, etc. • Requirements for formatting and validation • Multiple examples of preferred translation • Separate guidelines for specialized tasks • Post-editing machine translation output • Translation alternatives • Translation of novel single sentences • Translation of handwritten document images
Translation QC • All translations undergo additional QC at LDC • Typically 10% of training data, 100% of evaluation data reviewed • Standardized QC rating system deducts points for each type of error • QC report including score, examples sent to translators • Failing score requires re-translation of full data set • QC process facilitated by customized TransQC GUI
Translation Project Management • Translation database is core management tool • Document ID, language, genre, token count, LDC file server path • Data set information including project, phase, partition, restrictions • Translator assignment, due date, status, QC score, payment info • Backend to LDC Translator Extranet • Translators access and submit assignments, validate submissions, view QC reports, generate invoices, check payment status • Queries support status tracking but also assignment generation, data selection, cross-project coordination • What translation assignments are pending delivery this week? • What is average QC score for this translator on Chinese BC? • List Arabic NW files from 2007 that have never been released as GALE training data and are not part of any project's eval set
Parallel text harvesting • Manual translation supplemented by harvesting and alignment of potential parallel text • Harvest text from multilingual sites • E.g. newswire providers • Standardize markup format • Use BITS document mapping module to find likely parallel documents • Use Champollion to find sentence alignments • High yields in GALE program • 82,000 Arabic-English document pairs • 67,000 Chinese-English document pairs
Conclusion • Robust, flexible translation infrastructure to support multiple, distinct, concurrent projects • Much of this infrastructure freely available from LDC • Task specifications, guidelines available for all projects • http://projects.ldc.upenn.edu/gale/Translation/ • QCTrans GUI slated for free, open-source distribution • Many resulting parallel text corpora already in LDC Catalog • Newly emerging data sets to be added over time
Acknowledgements • This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.