250 likes | 257 Views
This paper explores the challenges of translating from morphology-rich languages, such as Russian and Arabic, to English. The authors propose a log-linear model for inflection prediction and evaluate its accuracy using Russian and Arabic datasets.
E N D
Learning to Generate Complex Morphology for Machine Translation Einat Minkov†, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research †Carnegie Mellon University
Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. woman nice this fem masc masc
Motivation System guess(Quirk et al, 05)
Motivation System guess(Quirk et al, 05) Correct
Motivation System guess(Quirk et al, 05) Correct
SMT challenges forEnglish Morphology rich language • Information ‘missing’ on source side • Data sparsity • Morphological agreement in the target language
Related work • Translation from morphology-rich languages to English • Preprocessing of the inputs, to improve alignmentsArabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) • Translation from English to morphology-rich languages • Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) • Our approach • Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)
ununa dideldeidella… eliminareeliminoelimini eliminiamo… Morphology Prediction • Morphology generation as classification: Classify each stem into an inflected form Source: Eliminate a primary key constraint System guess: eliminare un vincolo di chiave primario vincolovincoli chiavechiavi primarioprimariaprimariprimarie Possible inflections
Outline • Morphology • Russian, Arabic • Lexicon operations • The task of inflection prediction • A log-linear model • Features • Lexical, Syntactic and Morphological • Experiments
Russian Morphology • 3 genders, 2 numbers, 6 cases (nom, acc, location …) • Nouns have gender, and inflect for number and case • Adjectives agree with nouns in number, gender, and case; • Verbs agree with Subject person and number (past tense agrees with gender and number) Уменя есть синий карандаш at me is blue pencil Pers1 Pres GenNom Nom MascMasc SingSing
Arabic morphology • Arabic: inflection + clitics • Prefixes: Conj/Prep/Det (in strict order) • Suffixes: Object pronouns/Possessive pronouns • Agreement: • In person, number, gender and definiteness (from Bar-Haim et al) فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it وللمكتبات /walilmaktabāt/ و+ل+ال+مكتبة+ات wa+li+al+maktabāt and+for+the+libraries and for the libraries (from Nizar Habash)
Lexicon Operations Set of possible lemmas то, тот Stemming Inflection Surface word Lexicon Set of possible morphological variants то того, тому, тем, том, те, тех, теми,то Analysis Set of possible morphological analyses тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then)
y1 y2 y3 y4 Inflection Prediction Model • Given a sentence, predict the inflection of each word. • Conditional Markov Model • Sentence processed left-to-right(can be applied top-down) • Features: pairs of target and context predicates • Can model agreement:POS(yi-2)=DT & Number(yi-1)=sg &Number(yi )=sg
Linguistic annotations • Annotations used in Quirk et al (05) system Source dependency tree POS &morphological features Surface features POS &morphological features Projected dependency tree
Features Monoligual Bilingual Inflection stemleft stemright stemyi-1,yi-2parent stem… aligned words aiparent (ai)left sister (ai)right sister (ai)POS (ai)number (ai)person (ai)tense (ai)det* (ai)prep* (ai)pron* (ai)… inflection (yi)POS (yi)tense (yi)number (yi)… Lexical Syntax POS (yi-1)number(yi-1)person (yi-1)tense(yi-1)… Morph.
Russian [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Arabic [Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem]
Reference Experiments • Baselines • Random baseline (pick a label at random) • Word-trigram language model baseline • Trained using the CMU toolkit on the same training dataset • Models • Monolingual word / all, Bilingual Word / all • Lexicons: • Russian dictionary, Arabic: Buckwalter analyzer • Evaluated only on words in the lexicon
Russian inflection prediction: accuracy • The suggested model better than a language model • Syntactic and morphological features are informative
Error Analysis • Russian • Gender of pronoun (it ~ he/she/it) • Case/Gender in coordinate construction • Morphological analysis ambiguity • Arabic • Gender/Number of pronoun • Definiteness in noun phrases
Summary • Proposed a general framework for improving SMT into morphology rich languages • Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. • Achieved good results also for little training data
Future Directions • Integration with the MT system • Initial results for Russian: 1.7 BLEU improvement • Improvements to the model and features • Morphological disambiguation • Semantic role labeling • Longer distance agreements (e.g. pronoun coreference) • More languages