460 likes | 555 Views
Knowledge-Rich MT. Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith. November 4, 2011. Outline. Where are we starting with end-to-end MT? Adapting SMT for low-resource scenarios What progress have we been making? What does Year 2 hold?. Cross-site system comparison.
E N D
Knowledge-Rich MT • Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith November 4, 2011
Outline • Where are we starting with end-to-end MT? • Adapting SMT for low-resource scenarios • What progress have we been making? • What does Year 2 hold?
S'il vous plaît traduire... English English français decoder TM learner learner LM Please translate... The SMT baseline
English English français TM learner learner LM The problem?
English learner LM Low-resource! English Malagasy TM learner
English Small, Out of domain learner LM Low-resource! English Malagasy TM
English learner LM Low-resource! English Malagasy TM Malagasy verbal morphology “Partial” language models
English learner LM Low-resource! English Malagasy TM Malagasy verbal morphology Dependency parses Unsupservisedmodel outputs
English learner LM Low-resource! English Malagasy TM Unsupservisedmodel outputs Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao,
Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao,
Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao, Translation Model
Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao, Translation Model henemana no hana ... something intelligible ...
Accomplishments • Better alignments, better translations • Feature-rich translation • 10s of millions of features • Diverse knowledge sources • Phrase dependency translation model • phrase ordering with a dependency model
Model 4 CMU
Model 4 CMU
Model 4 CMU
Model 4 CMU Similar pattern of improvements,no language-specific features (yet).
Malagasy - English Malagasy - English version 1.0
What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman . the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman . the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .
What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul, the son of a canaanite woman . the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman . the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .
What improvements? then the woman said to the serpent , “ no ! you will not die . now the serpent said to the woman , “ you will not die . the serpent said to the woman , “ surely you will not die ,
What improvements? then the woman said to the serpent , “ no ! you will not die . now the serpent said to the woman, “ you will not die . the serpent said to the woman , “ surely you will not die ,
Feature-rich translation • Discriminative learning on training data • Learn much sparser features than possible with just a development set • Update weights to improve translation probability • Final tuning pass on development set to optimize translation metrics (BLEU, METEOR, etc.)
Phrase-based output: Our System:
Phrase-based output: Use features from source-side parse Our System:
% BLEU Target Syntax Only
% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules
% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules Target Syntax + String-to-Tree Rules + Tree-to-Tree Features
Our best results use supervised parsers for both source and target languages • What about unsupervised parsing?
Our best results use supervised parsers for both source and target languages • What about unsupervised parsing? • We use the dependency model with valence (Klein & Manning, 2004) • With careful initialization, it gives state-of-the-art results (Gimpel & Smith, 2011): • 53.1% attachment accuracy on Penn Treebank • 44.4% on Chinese Treebank
Year 2 “Into other languages” • Target morphological complexity • Generate novel word forms • Leverage morphological resources and machine learning • Need better language models, not just translation models
Year 2 Challenges • Generating new word forms means a much larger search space than is usual in MT • Inference is expensive • Use “high-recall” linguistic tools to constrain search • Statistics do the rest
Year 2 • Data requirements • Large non-English monolingual corpora • Test sets for focus languages