1 / 12

Machine Translation Post-Editing Study Project Kent State Project Meeting

Alon Lavie Language Technologies Institute Carnegie Mellon University 8 June 2011. Machine Translation Post-Editing Study Project Kent State Project Meeting. Meeting Goals. Work out details of a summer pilot project on MT post-editing involving CMU and Kent State

dsheri
Download Presentation

Machine Translation Post-Editing Study Project Kent State Project Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Alon Lavie Language Technologies InstituteCarnegie Mellon University 8 June 2011 Machine Translation Post-Editing Study ProjectKent State Project Meeting

  2. Meeting Goals • Work out details of a summer pilot project on MT post-editing involving CMU and Kent State • Discuss long-term research goals and possible funding opportunities • Identify concrete target program for a research grant proposal

  3. Longterm Project Goals • Technology Goal: design MT systems that are most useful and productive for use by human translators as a CAT tool • Project Goals: • Develop an in-depth understanding of the characteristics of MT post-editing within commercially-relevant settings • Develop measures for quantifying the suitability of MT systems to the task of MT post-editing • Explore advanced methods for optimizing MT for post-editing, and for integrating MT into CAT environments

  4. Research Questions • What types of MT errors are easy for human translators to correct, and what types are difficult?  Can we create a taxonomy of such errors? • How do these error characteristics of MT systems vary across different MT approaches and technologies (i.e. "rule-based" systems vs. "statistical" systems)? • How do these error characteristics vary for different target-languages and language-pairs? • How do these error characteristics differ between "generic" MT systems (such as Google) vs. MT systems that are directly adapted to domain and client data? • How should translations produced by MT be presented and displayed to translators most effectively for post-editing?  Should poor MT translations be filtered out as to not confuse translators? • Can we design measures that better capture the post-editing "difficulty" of MT output?  If so, can we use these measures to produce MT output that is easier for translators to post-edit?

  5. Pilot Project Goals • Collect preliminary data that supports developing a solid scientific research agenda for a long-term research project • Become familiar with the task and challenges involved • Develop an effective working relationship between MT research team at CMU and translation studies research team at Kent State

  6. Commercially-Relevant Setting • Research should be framed in a commercially-relevant setting, where MT has been shown to produce significant gains in translator productivity, so that outcomes bear immediate impact on translation industry • Main characteristics of such settings: • Commercially-relevant domain and data • MT and TMs integrated within a common CAT editing environment for human translators (i.e. TRADOS) • Domain and/or client-adapted MT as opposed to “generic” MT engines (i.e. Google) • Probably too complex and difficult to create a complete commercial setup for the summer pilot project, so simplify to the minimum required in order to collect meaningful data

  7. Proposed Setting for Pilot • Domain: Computer Hardware and Software documentation and software localization • Language-Pair: English-Spanish • In what direction? English-to-Spanish? Spanish-to-English? Both? • No Translation Memories or integration of MT with TMs • Simple GUI for MT error classification and MT post-editing

  8. Proposed MT Systems • Domain-specific statistical MT system can be developed by Safaba Translation Solutions • Data: About 4 million TUs (60 million words) of domain-specific training data that Safaba has acquired from the TAUS Data Association (TDA) • System can be trained and ready for use within a couple of weeks • Will be made available online for remote access and connection • Use two other MT systems as comparisons for the study: • Google: “generic” (unadapted) high-quality SMT system • BabelFish/SYSTRAN: “generic” (unadapted) rule-based MT system • Is this too much?

  9. Proposed Pilot Study • Task-1: collect data on high-level classification of MT utility for post-editing: • Translators classify MT-translated segments into one of three categories: • MT translation does not require any post-editing (perfect) • MT translation requires post-editing and can be post-edited • MT translation is unintelligible and cannot be effectively post-edited • Task-2: analysis of the data collected: • Inter and Intra coder agreement levels • Distributional analysis • Variation across type of MT and other controlled variables • Task-3: Perform a more detailed classification of the data from category-2 into types of error and their difficulty • Task-4: Perform actual post-editing of data from category-2, with time and end-quality measurements

  10. Preliminary Tasks • Selection of documents for the pilot study • Domain relevant data from online resources • Preferably with target human translations • Controlling for document and segment difficulty (and length)? • Who does this, and how soon? • Creation of the required user interfaces • Design and develop simple online interfaces • Who does this, and how soon? • Testing • Identifying and selecting translator subjects • Do you have students and are they available? • IRB

  11. Discussion…

  12. Grant Opportunities • NSF: • NSF Information and Intelligent Systems (IIS) Core Programs: • http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=13707&org=IIS&sel_org=IIS&from=fund • Medium-size Projects: Proposals due 9/15/2011 • Cyber-Enabled Discovery and Innovation (CDI) program: • http://www.nsf.gov/publications/pub_summ.jsp?WT.z_pims_id=503163&ods_key=nsf11502 • Next deadline is unclear • Highly competitive • Grant Opportunities for Academic Liaison with Industry (GOALI) program: • http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=13706&org=IIS&sel_org=IIS&from=fund • This program accepts proposals anytime, but the funding level is unclear. • Other US Government Funding Sources, such as NSA, NVTC

More Related