Learning-based MT Approaches for Languages with Limited Resources

Learning-based MT Approaches for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Jaime Carbonell, Lori Levin, Kathrin Probst, Erik Peterson, Christian Monson, Ariadna Font-Llitjos, Alison Alvarez, Roberto Aranovich Learning-based MT

Outline • Rationale for learning-based MT • Roadmap for learning-based MT • Framework overview • Elicitation • Learning transfer Rules • Automatic rule refinement • Learning Morphology • Example prototypes • Implications for MT with vast parallel data • Conclusions and future directions Learning-based MT

Machine Translation: Where are we today? • Age of Internet and Globalization – great demand for MT: • Multiple official languages of UN, EU, Canada, etc. • Documentation dissemination for large manufacturers (Microsoft, IBM, Caterpillar) • Economic incentive is still primarily within a small number of language pairs • Some fairly good commercial products in the market for these language pairs • Primarily a product of rule-based systems after many years of development • Pervasive MT between most language pairs still non-existent and not on the immediate horizon Learning-based MT

Approaches to MT:Vaquois MT Triangle Interlingua Give-information+personal-data (name=alon_lavie) Generation Analysis Transfer [s [vp accusative_pronoun “chiamare” proper_name]] [s [np [possessive_pronoun “name”]] [vp “be” proper_name]] Direct Mi chiamo Alon Lavie My name is Alon Lavie Learning-based MT

Progression of MT • Started with rule-based systems • Very large expert human effort to construct language-specific resources (grammars, lexicons) • High-quality MT extremely expensive  only for handful of language pairs • Along came EBMT and then SMT… • Replaced human effort with extremely large volumes of parallel text data • Less expensive, but still only feasible for a small number of language pairs • We “traded” human labor with data • Where does this take us in 5-10 years? • Large parallel corpora for maybe 25-50 language pairs • What about all the other languages? • Is all this data (with very shallow representation of language structure) really necessary? • Can we build MT approaches that learn deeper levels of language structure and how they map from one language to another? Learning-based MT

Why Machine Translation for Languages with Limited Resources? • We are in the age of information explosion • The internet+web+Google anyone can get the information they want anytime… • But what about the text in all those other languages? • How do they read all this English stuff? • How do we read all the stuff that they put online? • MT for these languages would Enable: • Better government access to native indigenous and minority communities • Better minority and native community participation in information-rich activities (health care, education, government) without giving up their languages. • Civilian and military applications (disaster relief) • Language preservation Learning-based MT

The Roadmap to Learning-based MT • Automatic acquisition of necessary language resources and knowledge using machine learning methodologies: • Learning morphology (analysis/generation) • Rapid acquisition of broad coverage word-to-word and phrase-to-phrase translation lexicons • Learning of syntactic structural mappings • Tree-to-tree structure transformations [Knight et al], [Eisner], [Melamed] require parse trees for both languages • Learning syntactic transfer rules with resources (grammar, parses) for just one of the two languages • Automatic rule refinement and/or post-editing • A framework for integrating the acquired MT resources into effective MT prototype systems • Effective integration of acquired knowledge with statistical/distributional information Learning-based MT

CMU’s AVENUE Approach • Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences • Building Elicitation corpora from feature structures • Feature Detection and Navigation • Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages • Learn from major language to minor language • Translate from minor language to major language • XFER + Decoder: • XFER engine produces a lattice of possible transferred structures at all levels • Decoder searches and selects the best scoring combination • Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants • Morphology Learning • Word and Phrase bilingual lexicon acquisition Learning-based MT

Word-aligned elicited data English Language Model Learning Module Run Time Transfer System Word-to-Word Translation Probabilities Transfer Rules {PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2)) Decoder Lattice Translation Lexicon AVENUE Architecture Learning-based MT

Data Elicitation for Languages with Limited Resources • Rationale: • Large volumes of parallel text not available  create a small maximally-diverse parallel corpus that directly supports the learning task • Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool • Elicitation corpus designed to be typologically and structurally comprehensive and compositional • Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data Learning-based MT

Elicitation Tool:English-Chinese Example Learning-based MT

Elicitation Tool:English-Hindi Example Learning-based MT

Elicitation Tool:English-Arabic Example Learning-based MT

Elicitation Tool:Spanish-Mapudungun Example Learning-based MT

Designing Elicitation Corpora • What do we want to elicit? • Diversity of linguistic phenomena and constructions • Syntactic structural diversity • How do we construct an elicitation corpus? • Typological Elicitation Corpus based on elicitation and documentation work of field linguists (e.g. Comrie 1977, Bouquiaux 1992): initial corpus size ~1000 examples • Structural Elicitation Corpus based on representative sample of English phrase structures: ~120 examples • Organized compositionally: elicit simple structures first, then use them as building blocks • Goal: minimize size, maximize linguistic coverage Learning-based MT

Typological Elicitation Corpus • Feature Detection • Discover what features exist in the language and where/how they are marked • Example: does the language mark gender of nouns? How and where are these marked? • Method: compare translations of minimal pairs –sentences that differ in only ONE feature • Elicit translations/alignments for detected features and their combinations • Dynamic corpus navigation based on feature detection: no need to elicit for combinations involving non-existent features Learning-based MT

Typological Elicitation Corpus • Initial typological corpus of about 1000 sentences was manually constructed • New construction methodology for building an elicitation corpus using: • A feature specification: lists inventory of available features and their values • A definition of the set of desired feature structures • Schemas define sets of desired combinations of features and values • Multiplier algorithm generates the comprehensive set of feature structures • A generation grammar and lexicon: NLG generator generates NL sentences from the feature structures Learning-based MT

Structural Elicitation Corpus • Goal: create a compact diverse sample corpus of syntactic phrase structures in English in order to elicit how these map into the elicited language • Methodology: • Extracted all CFG “rules” from Brown section of Penn TreeBank (122K sentences) • Simplified POS tag set • Constructed frequency histogram of extracted rules • Pulled out simplest phrases for most frequent rules for NPs, PPs, ADJPs, ADVPs, SBARs and Sentences • Some manual inspection and refinement • Resulting corpus of about 120 phrases/sentences representing common structures • See [Probst and Lavie, 2004] Learning-based MT

Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Learning-based MT

Value constraints Agreement constraints Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Learning-based MT

Rule Learning - Overview • Goal: Acquire Syntactic Transfer Rules • Use available knowledge from the source side (grammatical structure) • Three steps: • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure • Compositionality Learning:use previously learned rules to learn hierarchical structure • Constraint Learning: refine rules by learning appropriate feature constraints Learning-based MT

Flat Seed Rule Generation Learning-based MT

Flat Seed Rule Generation • Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS • Words that are aligned word-to-word and have the same POS in both languages are generalized to their POS • Words that have complex alignments (or not the same POS) remain lexicalized • One seed rule for each translation example • No feature constraints associated with seed rules (but mark the example(s) from which it was learned) Learning-based MT

Compositionality Learning Learning-based MT

Compositionality Learning • Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks • Generalization: adjust constituent sequences and alignments • Two implemented variants: • Safe Compositionality: there exists a transfer rule that correctly translates the sub-constituent • Maximal Compositionality: Generalize the rule if supported by the alignments, even in the absence of an existing transfer rule for the sub-constituent Learning-based MT

Constraint Learning Learning-based MT

Constraint Learning • Goal: add appropriate feature constraints to the acquired rules • Methodology: • Preserve general structural transfer • Learn specific feature constraints from example set • Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments) • Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary • The seed rules in a group form the specific boundary of a version space • The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints Learning-based MT

Constraint Learning: Generalization • The partial order of the version space: Definition: A transfer rule tr1 is strictly more general than another transfer rule tr2 if all f-structures that are satisfied by tr2 are also satisfied by tr1. • Generalize rules by merging them: • Deletion of constraint • Raising two value constraints to an agreement constraint, e.g. ((x1 num) = *pl), ((x3 num) = *pl)  ((x1 num) = (x3 num)) Learning-based MT

Automated Rule Refinement • Bilingual informants can identify translation errors and pinpoint the errors • A sophisticated trace of the translation path can identify likely sources for the error and do “Blame Assignment” • Rule Refinement operators can be developed to modify the underlying translation grammar (and lexicon) based on characteristics of the error source: • Add or delete feature constraints from a rule • Bifurcate a rule into two rules (general and specific) • Add or correct lexical entries • See [Font-Llitjos, Carbonell & Lavie, 2005] Learning-based MT

Morphology Learning • Goal: Unsupervised learning of morphemes and their function from raw monolingual data • Segmentation of words into morphemes • Identification of morphological paradigms (inflections and derivations) • Learning association between morphemes and their function in the language • Organize the raw data in the form of a network of paradigm candidate schemes • Search the network for a collection of schemes that represent true morphology paradigms of the language • Learn mappings between the schemes and features/functions using minimal pairs of elicited data • Construct analyzer based on the collection of schemes and the acquired function mappings Learning-based MT

Example Vocabulary blame blamed blames roamed roaming roams solve solves solving Ø.s.d blame e.es blam solv Ø.s blame solve me.mes bla s blame roam solve Learning-based MT

me.mes.med bla e.es.ed blam Ø.s.d blame e.es blam solv Ø.s blame solve me.mes bla me.med bla e.ed blam Ø.d blame s.d blame mes.med bla es.ed blam e blam solv Ø blame blames blamed roams roamed roaming solve solves solving me bla s blame roam solve es blam solv mes bla med bla roa ed blam roam d blame roame 36

a.as.o.os.tro 1 cas • Spanish Newswire Corpus • 40,011 Tokens • 6,975 Types a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 37

Level 5 = 5 C-suffixes C-Stem Type Count C-Suffixes C-Stems a.as.o.os.tro 1 cas a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.tro 2 cas.cen a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... tro 16 catas, ce, cen, cua, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 38

a.as.o.os.tro 1 cas a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Adjective Inflection Class From the spurious c-suffix “tro” a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 39

a.as.o.os.tro 1 cas Decreasing C-Stem Count Increasing C-Suffix Count a.tro 2 cas.cen tro 16 catas, ce, cen, cua, ... Basic Search Procedure a.as.o.os 43 african, cas, jurídic, l, ... a.as.o 59 cas, citad, jurídic, l, ... a.as.os 50 afectad, cas, jurídic, l, ... a.o.os 105 impuest, indonesi, italian, jurídic, ... as.o.os 54 cas, implicad, jurídic, l, ... a.as 199 huelg, incluid, industri, inundad, ... as.o 85 intern, jurídic, just, l, ... o.os 268 human, implicad, indici, indocumentad, ... a.o 214 id, indi, indonesi, inmediat, ... a.os 134 impedid, impuest, indonesi, inundad, ... as.os 68 cas, implicad, inundad, jurídic, ... a 1237 huelg, ib, id, iglesi, ... as 404 huelg, huelguist, incluid, industri, ... o 1139 hub, hug, human, huyend, ... os 534 humorístic, human, hígad, impedid, ... 40

AVENUE Prototypes • General XFER framework under development for past three years • Prototype systems so far: • German-to-English, Dutch-to-English • Chinese-to-English • Hindi-to-English • Hebrew-to-English • In progress or planned: • Mapudungun-to-Spanish • Quechua-to-Spanish • Arabic-to-English • Native-Brazilian languages to Brazilian Portuguese Learning-based MT

Challenges for Hebrew MT • Puacity in existing language resources for Hebrew • No publicly available broad coverage morphological analyzer • No publicly available bilingual lexicons or dictionaries • No POS-tagged corpus or parse tree-bank corpus for Hebrew • No large Hebrew/English parallel corpus • Scenario well suited for CMU transfer-based MT framework for languages with limited resources Learning-based MT

Hebrew-to-English MT Prototype • Initial prototype developed within a two month intensive effort • Accomplished: • Adapted available morphological analyzer • Constructed a preliminary translation lexicon • Translated and aligned Elicitation Corpus • Learned XFER rules • Developed (small) manual XFER grammar as a point of comparison • System debugging and development • Evaluated performance on unseen test data using automatic evaluation metrics Learning-based MT

Morphology Example • Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| Learning-based MT

Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE)) Learning-based MT

Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money Learning-based MT

Test set of 62 sentences from Haaretz newspaper, 2 reference translations Evaluation Results Learning-based MT

Implications for MT with Vast Amounts of Parallel Data • Learning word/short-phrase translations vs. learning long phrase-to-phrase translations • Phrase-to-phrase MT ill suited for long-range reorderings  ungrammatical output • Recent work on hierarchical Stat-MT [Chiang, 2005] and parsing-based MT [Melamed et al, 2005] • Learning general tree-to-tree syntactic mappings is equally problematic: • Meaning is a hybrid of complex, non-compositional phrases embedded within a syntactic structure • Some constituents can be translated in isolation, others require contextual mappings Learning-based MT

Learning-based MT Approaches for Languages with Limited Resources