Machine Translation Overview

Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004

Machine Translation: History • MT started in 1940’s, one of the first conceived application of computers • Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems • AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s • MT Revival started in earnest in 1980s (US, Japan) • Field dominated by rule-based approaches, requiring 100s of K-years of manual development • Economic incentive for developing MT systems for small number of language pairs (mostly European languages) LTI Immigration Course

Machine Translation: Where are we today? • Age of Internet and Globalization – great demand for MT: • Multiple official languages of UN, EU, Canada, etc. • Documentation dissemination for large manufacturers (Microsoft, IBM, Caterpillar) • Economic incentive is still primarily within a small number of language pairs • Some fairly good commercial products in the market for these language pairs • Primarily a product of rule-based systems after many years of development • Pervasive MT between most language pairs still non-existent and not on the immediate horizon LTI Immigration Course

Best Current General-purpose MT PAHO’s Spanam system: • Mediante petición recibida por la Comisión Interamericana de Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra. • Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it. LTI Immigration Course

Core Challenges of MT • Ambiguity: • Human languages are highly ambiguous, and differently in different languages • Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms • Amount of required knowledge: • At least several 100k words, about as many phrases, plus syntactic knowledge (i.e. translation rules). How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent? LTI Immigration Course

How to Tackle the Core Challenges • Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. • Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. • Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s XFER approach • Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE! LTI Immigration Course

State-of-the-Art in MT • What users want: • General purpose (any text) • High quality (human level) • Fully automatic (no user intervention) • We can meet any 2 of these 3 goals today, but not all three at once: • FA HQ: Knowledge-Based MT (KBMT) • FA GP: Corpus-Based (Example-Based) MT • GP HQ: Human-in-the-loop (efficiency tool) LTI Immigration Course

Types of MT Applications: • Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ) • Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ) • Communication: Lower quality may be okay, but degraded input, real-time required. LTI Immigration Course

Approaches to MT:Vaquois MT Triangle Interlingua Give-information+personal-data (name=alon_lavie) Generation Analysis Transfer [s [vp accusative_pronoun “chiamare” proper_name]] [s [np [possessive_pronoun “name”]] [vp “be” proper_name]] Direct Mi chiamo Alon Lavie My name is Alon Lavie LTI Immigration Course

Knowledge-based Interlingual MT • The “obvious” deep Artificial Intelligence approach: • Analyze the source language into a detailed symbolic representation of its meaning • Generate this meaning in the target language • “Interlingua”: one single meaning representation for all languages • Nice in theory, but extremely difficult in practice LTI Immigration Course

The Interlingua KBMT approach • With interlingua, need only N parsers/ generators instead of N2 transfer systems: L2 L2 L3 L1 L1 L3 interlingua L6 L4 L6 L4 L5 L5 LTI Immigration Course

Statistical MT (SMT) • Proposed by IBM in early 1990s: a direct, purely statistical, model for MT • Statistical translation models are trained on a sentence-aligned translation corpus • Attractive: completely automatic, no manual rules, much reduced manual labor • Main drawbacks: • Effective only with huge volumes (several mega-words) of parallel text • Very domain-sensitive • Still viable only for small number of language pairs! • Impressive progress in last 3-4 years due to large DARPA funding program (TIDES) LTI Immigration Course

EBMT Paradigm New Sentence (Source) Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton over… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Translated Sentence (Target) Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.

GEBMT vs. Statistical MT Generalized-EBMT (GEBMT) uses examples at run time, rather than training a parameterized model. Thus: • GEBMT can work with a smaller parallel corpus than Stat MT • Large target language corpus still useful for generating target language model • Much faster to “train” (index examples) than Stat MT; until recently was much faster at run time as well • Generalizes in a different way than Stat MT (whether this is better or worse depends on match between Statistical model and reality): • Stat MT can fail on a training sentence, while GEBMT never will • GEBMT generalizations based on linguistic knowledge, rather than statistical model design LTI Immigration Course

Multi-Engine MT • Apply several MT engines to each input; use statistical language modeller to select best combination of outputs. • Goal is to combine strengths, and avoid weaknesses. • Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. • Used in various projects LTI Immigration Course

MEMT chart example

Speech-to-Speech MT • Speech just makes MT (much) more difficult: • Spoken language is messier • False starts, filled pauses, repetitions, out-of-vocabulary words • Lack of punctuation and explicit sentence boundaries • Current Speech technology is far from perfect • Need for speech recognition and synthesis in foreign languages • Robustness: MT quality degradation should be proportional to SR quality • Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance? LTI Immigration Course

MT at the LTI • LTI originated as the Center for Machine Translation (CMT) in 1985 • MT continues to be a prominent sub-discipline of research with the LTI • More MT faculty than any of the other areas • More MT faculty than anywhere else • Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT • Leader in the area of speech-to-speech MT LTI Immigration Course

KBMT: KANT, KANTOO, CATALYST • Deep knowledge-based framework, with symbolic interlingua as intermediate representation • Syntactic and semantic analysis into a unambiguous detailed symbolic representation of meaning using unification grammars and transformation mappers • Generation into the target language using unification grammars and transformation mappers • First large-scale multi-lingual interlingua-based MT system deployed commercially: • CATALYST at Caterpillar: high quality translation of documentation manuals for heavy equipment • Limited domains and controlled English input • Minor amounts of post-editing • Active follow-on projects • Contact Faculty: Eric Nyberg and Teruko Mitamura LTI Immigration Course

EBMT • Developed originally for the PANGLOSS system in the early 1990s • Translation between English and Spanish • Generalized EBMT under development for the past several years • Currently one of the two MT approaches developed at CMU for the DARPA/TIDES program • Chinese-to-English, large and very large amounts of sentence-aligned parallel data • Active research work on improving alignment and indexing, decoding from a lattice • Contact Faculty: Ralf Brown and Jaime Carbonell LTI Immigration Course

Statistical MT • Word-to-word and phrase-to-phrase translation pairs are acquired automatically from data and assigned probabilities based on a statistical model • Extracted and trained from very large amounts of sentence-aligned parallel text • Word alignment algorithms • Phrase detection algorithms • Translation model probability estimation • Main approach pursued in CMU systems in the DARPA/TIDES program: • Chinese-to-English and Arabic-to-English • Most active work is on phrase detection and on advanced lattice decoding • Contact Faculty: Stephan Vogel and Alex Waibel LTI Immigration Course

Speech-to-Speech MT • Evolution from JANUS/C-STAR systems to NESPOLE!, LingWear, BABYLON • Early 1990s: first prototype system that fully performed sp-to-sp (very limited domain) • Interlingua-based, but with shallow task-oriented representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double}) • Semantic Grammars for analysis and generation • Multiple languages: English, German, French, Italian, Japanese, Korean, and others • Most active work on portable speech translation on small devices: Arabic/English and Thai/English • Contact Faculty: Alan Black, Tanja Schultz and Alex Waibel (also Alon Lavie and Lori Levin) LTI Immigration Course

AVENUE: Transfer-based MT • Develop new approaches for automatically acquiring syntactic MT transfer rules from small amounts of elicited translated and word-aligned data • Specifically designed to bootstrap MT for languages for which only limited amounts of electronic resources are available (particularly indigenous minority languages) • Use machine learning techniques to generalize transfer rules from specific translated examples • Combine with decoding techniques from SMT for producing the best translation of new input from a lattice of translation segments • Languages: Hebrew, Hindi, Mapudungun, Quechua • Most active work on designing a typologically comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction • Contact Faculty: Alon Lavie, Lori Levin and Jaime Carbonell LTI Immigration Course

Word-aligned elicited data English Language Model Learning Module Run Time Transfer System Word-to-Word Translation Probabilities Transfer Rules {PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2)) Decoder Lattice Translation Lexicon Transfer with Strong Decoding LTI Immigration Course

MT for Minority and Indigenous Languages: Challenges • Minimal amount of parallel text • Possibly competing standards for orthography/spelling • Often relatively few trained linguists • Access to native informants possible • Need to minimize development time and cost LTI Immigration Course

Learning Transfer-Rules for Languages with Limited Resources • Rationale: • Large bilingual corpora not available • Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool • Elicitation corpus designed to be typologically comprehensive and compositional • Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data LTI Immigration Course

English-Hindi Example LTI Immigration Course

Questions… LTI Immigration Course

Machine Translation Overview