70 likes | 82 Views
Explore the challenges of translating Chinese novels into English using Neural Machine Translation due to limited parallel corpora data and discover the impact of different corpus mixing techniques on translation quality. Analyze the Casia2015 Chinese-English parallel corpus and the Chinese Novel Corpus, and evaluate various models for efficiency. Additionally, consider the advantages and disadvantages of mixed corpus translations versus pure corpus models for improving translation accuracy and fluency.
E N D
Domain Mixing for Chinese-English Translation Chris Leege
The Project • Goal • Translate Chinese novels into English using Neural Machine Translation • Challenges • Chinese to English translation requires a lot of data • There aren’t many Chinese and English parallel corpora • The majority of Chinese and English parallel corpora are in domains other than novels, mostly news, UN reports, or subtitles
Data • Casia2015 Chinese-English parallel corpus • One million parallel sentences from around the web. • Chinese Novel Corpus • Manually aligned • 45,000 parallel sentences • 2,096,000 characters
Models Pure Casia2015 corpus Pure novel corpus Naïve mixed corpus Mixed corpus with target tokens • Effective Domain Mixing for Neural Machine Translation
Results Figure 2. BLEU Scores for the four models. Models on the y-axis, test data on the x-axis
Results Figure 3. BLEU Scores for the second four models. Models on the y-axis, test data on the x-axis
Conclusion • Possible Issues • Casia2015 too heterogenous • Not enough data • Next Steps • Try again with a larger, more homogenous corpus, such as the UN corpus