1 / 25

Learning to Generate Complex Morphology for Machine Translation

This paper explores the challenges of translating from morphology-rich languages, such as Russian and Arabic, to English. The authors propose a log-linear model for inflection prediction and evaluate its accuracy using Russian and Arabic datasets.

tliptak
Download Presentation

Learning to Generate Complex Morphology for Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Generate Complex Morphology for Machine Translation Einat Minkov†, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research †Carnegie Mellon University

  2. Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. woman nice this fem masc masc

  3. Motivation

  4. Motivation System guess(Quirk et al, 05)

  5. Motivation System guess(Quirk et al, 05) Correct

  6. Motivation System guess(Quirk et al, 05) Correct

  7. SMT challenges forEnglish  Morphology rich language • Information ‘missing’ on source side • Data sparsity • Morphological agreement in the target language

  8. Related work • Translation from morphology-rich languages to English • Preprocessing of the inputs, to improve alignmentsArabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) • Translation from English to morphology-rich languages • Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) • Our approach • Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)

  9. ununa dideldeidella… eliminareeliminoelimini eliminiamo… Morphology Prediction • Morphology generation as classification: Classify each stem into an inflected form Source: Eliminate a primary key constraint System guess: eliminare un vincolo di chiave primario vincolovincoli chiavechiavi primarioprimariaprimariprimarie Possible inflections

  10. Outline • Morphology • Russian, Arabic • Lexicon operations • The task of inflection prediction • A log-linear model • Features • Lexical, Syntactic and Morphological • Experiments

  11. Russian Morphology • 3 genders, 2 numbers, 6 cases (nom, acc, location …) • Nouns have gender, and inflect for number and case • Adjectives agree with nouns in number, gender, and case; • Verbs agree with Subject person and number (past tense agrees with gender and number) Уменя есть синий карандаш at me is blue pencil Pers1 Pres GenNom Nom MascMasc SingSing

  12. Arabic morphology • Arabic: inflection + clitics • Prefixes: Conj/Prep/Det (in strict order) • Suffixes: Object pronouns/Possessive pronouns • Agreement: • In person, number, gender and definiteness (from Bar-Haim et al) فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it وللمكتبات /walilmaktabāt/ و+ل+ال+مكتبة+ات wa+li+al+maktabāt and+for+the+libraries and for the libraries (from Nizar Habash)

  13. Lexicon Operations Set of possible lemmas то, тот Stemming Inflection Surface word Lexicon Set of possible morphological variants то того, тому, тем, том, те, тех, теми,то Analysis Set of possible morphological analyses тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then)

  14. y1 y2 y3 y4 Inflection Prediction Model • Given a sentence, predict the inflection of each word. • Conditional Markov Model • Sentence processed left-to-right(can be applied top-down) • Features: pairs of target and context predicates • Can model agreement:POS(yi-2)=DT & Number(yi-1)=sg &Number(yi )=sg

  15. Linguistic annotations • Annotations used in Quirk et al (05) system Source dependency tree POS &morphological features Surface features POS &morphological features Projected dependency tree

  16. Features Monoligual Bilingual Inflection stemleft stemright stemyi-1,yi-2parent stem… aligned words aiparent (ai)left sister (ai)right sister (ai)POS (ai)number (ai)person (ai)tense (ai)det* (ai)prep* (ai)pron* (ai)… inflection (yi)POS (yi)tense (yi)number (yi)… Lexical Syntax POS (yi-1)number(yi-1)person (yi-1)tense(yi-1)… Morph.

  17. Russian [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Arabic [Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem]

  18. Reference Experiments • Baselines • Random baseline (pick a label at random) • Word-trigram language model baseline • Trained using the CMU toolkit on the same training dataset • Models • Monolingual word / all, Bilingual Word / all • Lexicons: • Russian dictionary, Arabic: Buckwalter analyzer • Evaluated only on words in the lexicon

  19. Russian inflection prediction: accuracy • The suggested model better than a language model • Syntactic and morphological features are informative

  20. Arabic inflection prediction: accuracy

  21. Accuracy vs. training data size

  22. Error Analysis • Russian • Gender of pronoun (it ~ he/she/it) • Case/Gender in coordinate construction • Morphological analysis ambiguity • Arabic • Gender/Number of pronoun • Definiteness in noun phrases

  23. Summary • Proposed a general framework for improving SMT into morphology rich languages • Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. • Achieved good results also for little training data

  24. Future Directions • Integration with the MT system • Initial results for Russian: 1.7 BLEU improvement • Improvements to the model and features • Morphological disambiguation • Semantic role labeling • Longer distance agreements (e.g. pronoun coreference) • More languages

  25. Thanks! Questions?

More Related