1 / 46

Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith

Knowledge-Rich MT. Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith. November 4, 2011. Outline. Where are we starting with end-to-end MT? Adapting SMT for low-resource scenarios What progress have we been making? What does Year 2 hold?. Cross-site system comparison.

dulcea
Download Presentation

Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge-Rich MT • Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith November 4, 2011

  2. Outline • Where are we starting with end-to-end MT? • Adapting SMT for low-resource scenarios • What progress have we been making? • What does Year 2 hold?

  3. Cross-site system comparison

  4. S'il vous plaît traduire... English English français decoder TM learner learner LM Please translate... The SMT baseline

  5. SMT Baselines

  6. SMT Baselines

  7. Let’s make things better.

  8. English English français TM learner learner LM The problem?

  9. English learner LM Low-resource! English Malagasy TM learner

  10. English Small, Out of domain learner LM Low-resource! English Malagasy TM

  11. English learner LM Low-resource! English Malagasy TM Malagasy verbal morphology “Partial” language models

  12. English learner LM Low-resource! English Malagasy TM Malagasy verbal morphology Dependency parses Unsupservisedmodel outputs

  13. English learner LM Low-resource! English Malagasy TM Unsupservisedmodel outputs Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao,

  14. Year 1 MT Challenge

  15. Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao,

  16. Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao, Translation Model

  17. Year 1 MT Challenge English Malagasy Malagasy verbal morphology Word clusters Dependency parses 36:dieny,fara,fiompiny,hamoaka,handehanany 37:adinina,aforeto,ahevahevao,akaiky,alao, Translation Model henemana no hana ... something intelligible ...

  18. Accomplishments • Better alignments, better translations • Feature-rich translation • 10s of millions of features • Diverse knowledge sources • Phrase dependency translation model • phrase ordering with a dependency model

  19. Model 4 CMU

  20. Model 4 CMU

  21. Model 4 CMU

  22. Model 4 CMU Similar pattern of improvements,no language-specific features (yet).

  23. Malagasy - English Malagasy - English version 1.0

  24. What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman . the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman . the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .

  25. What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul, the son of a canaanite woman . the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman . the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ) .

  26. What improvements? then the woman said to the serpent , “ no ! you will not die . now the serpent said to the woman , “ you will not die . the serpent said to the woman , “ surely you will not die ,

  27. What improvements? then the woman said to the serpent , “ no ! you will not die . now the serpent said to the woman, “ you will not die . the serpent said to the woman , “ surely you will not die ,

  28. Feature-rich translation • Discriminative learning on training data • Learn much sparser features than possible with just a development set • Update weights to improve translation probability • Final tuning pass on development set to optimize translation metrics (BLEU, METEOR, etc.)

  29. What features?

  30. Contexts give clues to contintuents

  31. Contexts give clues to contintuents

  32. German - English

  33. Phrasal dependency translation model

  34. Phrase-based output:

  35. Phrase-based output: Our System:

  36. Phrase-based output: Use features from source-side parse Our System:

  37. % BLEU Target Syntax Only

  38. % BLEU Target Syntax Only Target Syntax + String-to-Tree Rules

  39. % BLEU Target Syntax Only Target Syntax + String-to-Tree Rules Target Syntax + String-to-Tree Rules + Tree-to-Tree Features

  40. Our best results use supervised parsers for both source and target languages • What about unsupervised parsing?

  41. Our best results use supervised parsers for both source and target languages • What about unsupervised parsing? • We use the dependency model with valence (Klein & Manning, 2004) • With careful initialization, it gives state-of-the-art results (Gimpel & Smith, 2011): • 53.1% attachment accuracy on Penn Treebank • 44.4% on Chinese Treebank

  42. % BLEU

  43. Year 2 “Into other languages” • Target morphological complexity • Generate novel word forms • Leverage morphological resources and machine learning • Need better language models, not just translation models

  44. Year 2 Challenges • Generating new word forms means a much larger search space than is usual in MT • Inference is expensive • Use “high-recall” linguistic tools to constrain search • Statistics do the rest

  45. Year 2 • Data requirements • Large non-English monolingual corpora • Test sets for focus languages

More Related