430 likes | 452 Views
Seminar overview on advanced machine translation exploring morphology, translation mismatches, divergences, and solutions. Learn about different MT systems and Generation-Heavy MT solutions.
E N D
Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez
Overview • Introduction • Morphology Primer • Translation Mismatches • Types • Solutions • Translation Divergences • Types • Solutions • Different MT Systems • Generation Heavy Machine Translation • DUSTer
Source ≠ Target • Languages don’t encode the same information in the same way • Makes MT complicated • Keeps all of us employed
Morphology in a Nutshell • Morphemes are word parts • Work +er • Iki +ta +ku +na +ku +na +ri +ma +shi +ta • Types of Morphemes • Derivational: makes new word • Inflectional: adds information to an existing word
Morphology in a Nutshell • Analytic/Isolating • little or no inflectional morphology, separate words • Vietnamese, Chinese • I was made to go • Synthetic • Lots of inflectional morphology • Fusional vs. Agglutinating • Romance Languages, Finnish, Japanese, Mapudungun • Ika (to go) +se (to make/let) +rare (passive) +ta (past tense) • He need +s (3rd person singular) it.
Translation Differences • Types • Translation Mismatches • Different information from source to target • Translation Divergences • Same information from source to target, but the meaning is distributed differently in each language
Translation Mismatches • “…the information that is conveyed is different in the source and target languages” • Types: • Lexical level • Typological level
Lexical Mismatches • A lexical item in one language may have more distinctions than in another Brother 弟 otouto Younger Brother 兄さん Ani-san Older Brother
Typological Mismatches • Mismatch between languages with different levels of grammaticalization • One language may be more structurally complex • Source marking, Obligatory Subject
Typological Mismatches • Source: Quechua vs. English • (they say) s/he was singing --> takisharansi • taki (sing) +sha (progressive) +ra (past) + n (3rd sg) +si (reportative) • Obligatory Arguments: English vs. Japanese • Kusuri wo Nonda --> (I, you, etc.) took medicine. • Makasemasu! -->(I’ll) leave (it) to (you)
Translation Mismatch Solutions • More information --> Less information (easy) • Less information --> More information (hard) • Context clues • Language Models • Generalization • Formal representations
Translation Divergences • “…the same information is conveyed in source and target texts” • Divergences are quite common • Occurs in about 1 out of every three sentences in the TREC El Norte Newspaper corpus (Spanish-English) • Sentences can have multiple kinds of divergences
Translation Divergence Types • Categorial Divergence • Conflational Divergence • Structural Divergence • Head Swapping Divergence • Thematic Divergence
Categorial Divergence • Translation that uses different parts of speech • Tener hambre (have hunger) --> be hungry • Noun --> adjective
Conflational Divergence • The translation of two words using a single word that combines their meaning • Can also be called a lexical gap • X stab Z --> X darpuñaladas a Z (X give stabs to Z) • glastuinbouw --> cultivation under glass
Structural Divergence • A difference in the realization of incorporated arguments • PP to Object • X entrar en Y (X enter in Y) --> X enter Y • X ask for a referendum --> X pedir un referendum (ask-for a referendum)
Head Swapping Divergence • Involves the demotion of a head verb and the promotion of a modifier verb to head position S NP VP N V PP I ran into the room. S NP VP N V PP VP Yo entro en el cuarto corriendo
Thematic Divergence • This divergence occurs when sentence arguments switch argument roles from one language to another • X gustar a Y (X please to Y) --> Y like X
Divergence Solutions and Statistical/EBMT Systems • Not really addressed explicitly in SMT • Covered in EBMT only if it is covered extensively in the data
Divergence Solutions and Transfer Systems • Hand-written transfer rules • Automatic extraction of transfer rules from bi-texts • Problematic with multiple divergences
Divergence Solutions and Interlingua Systems • Mel’čuk’s Deep Syntactic Structure • Jackendoff’s Lexical Semantic Structure • Both require “explicit symmetric knowledge” from both source and target language • Expensive
Divergence Solutions and Interlingua Systems John swam across a river [event CAUSE JOHN [event GO JOHN [path ACROSS JOHN [position AT JOHN RIVER]]] [manner SWIM+INGLY]] Juan cruza el río nadando
Generation-Heavy MT • Built to address language divergences • Designed for source-poor/target-rich translation • Non-Interlingual • Non-Transfer • Uses symbolic overgeneration to account for different translation divergences
Generation-Heavy MT • Source language • syntactic parser • translation lexicon • Target language • lexical semantics, categorial variations & subcategorization frames for overgeneration • Statistical language model
Analysis Stage • Independent of Target Language • Creates a deep syntactic dependency • Only argument structure, top-level conceptual nodes & thematic-role information • Should normalize over syntactic & morphological phenomena
Translation Stage • Converts SL lexemes to TL lexemes • Maintains dependency structure
Analysis/Translation Stage GIVE (v) [cause go] I agent STAB (n) theme JOHN goal
Generation Stage • Lexical & Structural Selection • Conversion to a thematic dependency • Uses syntactic-thematic linking map • “loose” linking • Structural expansion • Addresses conflation & head-swapped divergences • Turn thematic dependency to TL syntactic dependency • Addresses categorial divergence
Generation Stage • Linearization Step • Creates a word lattice to encode different possible realizations • Implemented using oxyGen engine • Sentences ranked & extracted • Nitrogen’s statistical extractor
GHMT Results • 4 of 5 Spanish-English divergences “can be generated using structural expansion & categorial variations” • The remaining 1 out of 5 needed more world knowledge or idiom handling • SL syntactic parser can still be hard to come by
Divergences and DUSTer • Helps to overcome divergences for word alignment & improve coder agreement • Changes an English sentence structure to resemble another language • More accurate alignment and projection of dependency trees without training on dependency tree data
DUSTer • Motivation for the development of automatic correction of divergences • “Every Language Pair has translation divergences that are easy to recognize” • “Knowing what they are and how to accommodate them provides the basis for refined word level alignment” • “Refined word-level” alignment results in improved projection of structural information from English to another language
DUSTer • Bi-text parsed on English side only • “Linguistically Motivated” & common search terms • Conducted on Spanish & Arabic (and later Chinese & Hindi) • Uses all of the divergences mentioned before, plus a “light verb” divergence • Try put to trying poner a prueba
DUSTer Rule Development Methods • Identify canonical transformations for each divergence type • Categorize English sentences into divergence type or “none” • Apply appropriate transformations • Humans align E E’ foreign language
DUSTer Rules # "kill" => "LightVB kill(N)" (LightVB = light verb) # Presumably, this will work for "kill" => "give death to” # "borrow" => "take lent (thing) to” # "hurt" => "make harm to” # "fear" => "have fear of” # "desire" => "have interest in” # "rest" => "have repose on” # "envy" => "have envy of” type1.B.X [English{2 1 3} Spanish{2 1 3 4 5} ] [ Verb<1,i,CatVar:V_N> [ Noun<2,j,Subj> ] [ Noun<3,k,Obj> ] ] <--> [ LightVB<1,Verb>[ Noun<2,j,Subj> ] [ Noun<3,i,Obj> ] [ Oblique<4,Pred,Prep> [ Noun<5,k,PObj> ] ] ]
Conclusion • Divergences are common • They are not handled well by most MT systems • GHMT can account for divergences, but still needs development • DUSTer can handle divergences through structure transformations, but requires a great deal of linguistic knowledge
The End • Questions?
References Dorr, Bonnie J., "Machine Translation Divergences: A Formal Description and Proposed Solution," Computational Linguistics, 20:4, pp. 597--633, 1994. Dorr, Bonnie J. and Nizar Habash, "Interlingua Approximation: A Generation-Heavy Approach", In Proceedings of Workshop on Interlingua Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 1--6, 2002 Dorr, Bonnie J., Clare R. Voss, Eric Peterson, and Michael Kiker, "Concept Based Lexical Selection," Proceedings of the AAAI-94 fall symposium on Knowledge Representation for Natural Language Processing in Implemented Systems, New Orleans, LA, pp. 21--30, 1994. Dorr, Bonnie J., Lisa Pearl, Rebecca Hwa, and Nizar Habash, "DUSTer: A Method for Unraveling Cross-Language Divergences for Statistical Word-Level Alignment," Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 31--43, 2002. Habash, Nizar and Bonnie J. Dorr, "Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation", In Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA-2002,Tiburon, CA, pp. 84--93, 2002. Haspelmath, Martin. Understanding Morphology. Oxford Univeristy Press, 2002. Kameyama, Megumi and Ryo Ochitani, Stanley Peters “Resolving Translation Mismatches With Information Flow” Annual Meeting of the Assocation of Computational Linguistics, 1991