Machine Translation Introduction

Machine TranslationIntroduction Jan Odijk LOT Winterschool Amsterdam January 2011

Overview • MT: What is it • MT: What is not possible (yet?) • MT: Why is it so difficult? • MT: Can we make it possible? • MT: Evaluation • MT: What is (perhaps) possible • Conclusions

MT: What is it? • Input: text in source language • Output text in target language that is a translation of the input text

MT: What is it? Interlingua Analyzed input  transfer Analyzed output Input direct translation Output

MT: System Types • Direct: • Earliest systems (1950s) • Direct word-to-word translation • Recent statistical MT systems • Transfer • Almost all research and commercial systems <= 1990 • Interlingual

MT: System Types • Interlingual • A few research systems in the 1980s • Rosetta (Philips), based on Montague Grammar • Semantic derivation trees of attuned grammars • Distributed Translation (BSO) • (enriched) Esperanto • Sometimes logical representations • Hybrid Interlingual/Transfer • Transfer for lexicons; IL for rules

Rule-Based Systems • Most systems • explicit source language grammar • parser yields analysis of source language input • transfer component turns it into target language structure • no explicit grammar of target language (except morphology)

Rule-Based Systems • Some systems (Eurotra) • explicit source and target language grammar • sometimes reversible • parser yields analysis of source language input • transfer component turns it into target language structure • generation of translation by target language grammar

Rule-Based Systems • Some systems (Rosetta, DLT) • explicit source and target language grammar • in some cases reversible • parser yields interlingual representation • generation of translation by target language grammar from interlingual representation

MT: Is it difficult? • FAHQT: Fully Automatic High Quality Translation • Fully Automatic: no human intervention • High Quality: close or equal to human translation • Even acceptable quality is difficult to achieve

MT: Why is it so difficult? • Ambiguity • Real • Temporary • Computational Complexity • Complexity of language • Divergences • Language Competence v. Language Use • Require large and rich lexicons

MT: Why is it so difficult? • De jongen sloeg het meisje met de gitaar • Hij heeft boeken gelezen • Hij heeft uren gelezen • He has been reading books • *He has been reading for books • *He has been reading hours • He has been reading for hours

MT: Why is it so difficult? • Uren: not only also • dagen, de hele dag, weken, … • (Words expressing units of time) • But also: • De hele vergadering, meeting, bijeenkomst, les, … • (words expressing events)

MT: Why is it so difficult? • Hij draagt een bruin pak • Dragen: wear or carry • Pak: suit or package • Hij draagt een bruin pak en zwarte schoenen • Hij draagt een bruin pak onder zijn arm

MT: Why is it so difficult? • Voert uw bedrijf sloten uit? • Uitvoeren:execute, or export? • Bedrijf: act, or company? • Sloten: ditches, or locks?

MT: Why is it so difficult? • Temporary Ambiguity • Hij heeft boeken gelezen • Heeft: main or auxiliary verb? • Boeken: noun or verb • Voert uw bedrijf sloten uit? • Voert: form of voeren or of uitvoeren, • Bedrijf: noun or verb form? • Sloten uit: noun+particle or PP: out of ditches/locks

Why is MT difficult? • Ambiguity of natural language Summary • requires modeling of knowledge of the world /situation • by rule systems, and/or • by statistics

MT: Why is it so difficult? • Computational Complexity • High demands of processing capacity • High demands on memory • Complexity of language • Many different construction types • All interacting with each other

Why is MT difficult? • Divergences between language • require deep syntactic analysis • Or very sophisticated statistical techniques

Divergences: Category mismatches • Simple category mismatches • woonachtig (zijn) v. reside (Adj – Verb) • zich ergeren v. (be) annoyed (Verb-Adj) • verliefd v. in love (Adj- Prep+Noun) • kunnen v. (be) able • kunnen v. (be) possible • door- v. continue (to)

Divergences: Category mismatches • More complex category mismatches • graag vs. like (Adv vs. Verb) • hij zwemt graag vs. he likes to swim • toevallig vs. happen • hij viel toevalligvs. he happened to fall

Divergences: Category mismatches • Phrasal category mismatches • de zieke vrouw • the woman who is ill (* the ill woman) • I expect her to leave • ik verwacht dat zij vertrekt • She is likely to come • het is waarschijnlijk dat zij komt

Conflational Divergences: • prepositional complements • houden vanvs. love • existential er vs. Ø • er passeerde een auto vs. • a car passed • verbal particles • blow (something) up vs. volar

Conflational Divergences: • reflexive verbs • zich scheren vs. shave • composed vs. simple tense forms • he will do it vs. lo hará • split negatives vs. composed negatives • he does not see anyone vs. • hij ziet niemand

Functional Divergences: • I like these apples • me gustan estas manzanas • se venden manzanas aqui • hier verkoopt men appels • er werd door de toeschouwers gejuicht • the spectators were cheering

Divergences: MWEs • semi-fixed MWEs • nuclear power plant vs. kerncentrale • flexible idioms • de plaat poetsen vs. bolt • de pijp uit gaan v. to kick the bucket

Divergences: MWEs • semi-idioms (collocations) • zware shag vs. strong tobacco • semi-idioms (support verbs) • aandacht besteden aan • pay attention to

MT: Why is it so difficult? • Language Competence v. Language Use • Earlier systems implemented idealized reality • But not the really occurring language use • In some cases • focus on theoretically interesting difficult constructions • That do occur in reality • But other constructions are more important to deal with in practical systems

MT: Why is it so difficult? • Large and rich lexicons • Existing human-oriented dictionaries are not suited as such • All information must be available in a formalized way • Much more information is needed than in a traditional dictionary

MT: Why is it so difficult? • Multi-word Expressions (MWEs) • Are in current dictionaries only in a very informal way • No standards on how to represent them lexically • Many different types requiring different treatment in the grammar • Huge numbers!! • Domain and company-specific terminology are often MWEs

MT: Can we make it possible? • Probably not, • but we can still improve significantly • Lexicons • Selection restrictions • Approximating analyses • Statistical MT

MT: Can we make it possible? • Large and rich lexicons • widely accepted and used (de facto) standards • Methods and tools to quickly adapt to domain or company specific vocabulary • Better treatment of MWEs and standards for lexical representation of MWEs

MT: Can we make it possible? • Selection restrictions with type system to approach modeling of world knowledge • Requires sophisticated syntactic analysis • Boek: info (legible) • Uur: time unit  duration • Vergadering: event  duration • Lezen: subject=human; object=info (legible) • Durational adjunct must be a duration phrase

MT: Can we make it possible? • Selection restrictions • Pak (1) (suit): cloths • Pak (2) (package): entity • Dragen (1) (wear): subj=animate; object=cloths • Dragen (2) (carry): subj=animate; object= entity • Schoen: cloths • Entity > cloths • Identity preferred over subsumption • Homogeneous object preferred over heterogeneous one

MT: Can we make it possible? • Selection restrictions • Hij draagt een bruin pak • He wears a brown suit (1: cloths=cloths) • He carries a brown package (1: entity=entity) • He carries a brown suit (2: entity > cloth) • *He wears a brown package (cloth ¬> entity) • Hij draagt een bruin pak en zwarte schoenen • He wears a brown suit and black shoes (1: homogeneous and cloths=cloths) • He carries a brown suit and black shoes (2: homogeneous but entity > cloths) • He carries a brown package and black shoes(2: inhomogeneous but entity=entity) • *He wears a brown package and black shoes (cloths ¬> entity)

MT: Can we make it possible? • Approximating analyses • Ignore certain ambiguities to begin with • Use only limited amount of relevant information • Cut off analysis when there are too many alternatives • This is currently actually done in all practical systems • Need new ways of doing this without affecting quality too seriously

MT: Can we make it possible? • Statistical MT • Derives MT-system automatically • From statistics taken from • Aligned parallel corpora ( translation model) • Monolingual target language corpora ( language model) • Being worked since early 90’s

MT: Can we make it possible? • Plus: • No or very limited grammar development • Includes language and world knowledge automatically (but implicitly) • Based on actually occurring data • Currently many experimental and commercial systems • Minus: • Requires large aligned parallel corpora • Unclear how much linguistics will be needed anyway • Probably restricted to very limited domains only

MT: Can we make it possible? • Google Translate (statistical MT) • Hij draagt een pak.  √He wears a suit. • Hij draagt schoenen.  √ He wears shoes. • Hij draagt bruine schoenen en een pak. •  √ He wears a suit and brown shoes. (!!) • Hij draagt het pakket  √ He carries the package • Hij heeft een pak aan.  *He has a suit. • Voert uw bedrijf sloten uit? •  *Does your company locks out?

MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” • Lists data and tools for European language pairs • Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches

MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” • Lists data and tools for European language pairs • Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches • Successor project EuromatrixPlus

MT: Can we make it possible? • META-NET 2010-2013 (EU-funding) • Building a community with shared vision and strategic research agenda • Building META-SHARE, an open resource exchange facility • Building bridges to neighbouring technology fields • Bringing more Semantics into Translation • Optimising the Division of Labour in Hybrid MT • Exploiting the Context for Translation • Empirical Base for Machine Translation

MT: Can we make it possible? • PACO-MT 2008-2011 • Investigates hybrid approach to MT • Rule-based and statistical • Uses existing parser for source language analysis • Uses statistical n-gram language models for generation • Uses statistical approach to transfer

MT Evaluation • Evaluation depends on purpose of MT and how it is used • application, domain, controlled language • Many aspects can be evaluated • functionality, efficiency, usability, reliability, maintainability, portability • translation quality • embedding in work flow • post-editing options/tools

MT Evaluation • Focus here: • does the system yield good translations according to human judgement • in the context of developing a system • Again, many aspects: • fidelity (how close), correctness, adequacy, informativeness, intelligibility, fluency • and many ways to measure these aspects

MT Evaluation • Test suite • Reference = • list of (carefully selected) sentences • with their translations (ordered by score) • translations judged correct by human (usually developer) • upon every update of the system output of the new system is compared to the reference • if different: system has to be adapted, or reference has to be adapted • Advantages • focus on specific translation problems possible • excellent for regression testing • Manual judgement needed only once for each new output • –other comparisons are automatic • Disadvantages • not really independent • particularly suited for pure rule-based systems • human judgement needed if output differs from reference

MT Evaluation • Comparison against • translation corpus • independently created by human translators • possibly multiple equivalently correct translations of a sentence • Advantages • truely independent • also suited for data-driven systems • Disadvantage • requires human judgement (every time there is a system update) • high effort by highly skilled people, high costs, requires a lot of time • human judgement is not easy (unless there is a perfect match) • Useful • for a one-time evaluation of a stable system • not for evaluation during development

MT Evaluation • Edit-Distance (Word Accuracy) • metric to determine closeness of translations automatically • the least number of edit operations to turn the translated sentence into the reference sentence • Alshawi et al. 1998

MT Evaluation • WA = 1- ((d+s+i)/max(r,c)) • d= number of deletions • s = number of substitutions • i = number of insertions • r = reference sentence length • c = candidate sentence length • easy to calculate using Levenshtein distance algorithm (dynamic programming) • various extensions have been proposed

MT Evaluation • Advantages • fully automatic given a reference set • Disadvantages • penalizes candidates if a synonym is used • penalizes swaps of words and block of words too much

Machine Translation Introduction