1 / 90

Machine Translation Introduction

Machine Translation Introduction. Jan Odijk LOT Winterschool Amsterdam January 2011. Overview. MT: What is it MT: What is not possible (yet?) MT: Why is it so difficult? MT: Can we make it possible? MT: Evaluation MT: What is (perhaps) possible Conclusions. MT: What is it?.

glenda
Download Presentation

Machine Translation Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine TranslationIntroduction Jan Odijk LOT Winterschool Amsterdam January 2011

  2. Overview • MT: What is it • MT: What is not possible (yet?) • MT: Why is it so difficult? • MT: Can we make it possible? • MT: Evaluation • MT: What is (perhaps) possible • Conclusions

  3. MT: What is it? • Input: text in source language • Output text in target language that is a translation of the input text

  4. MT: What is it? Interlingua Analyzed input  transfer Analyzed output Input direct translation Output

  5. MT: System Types • Direct: • Earliest systems (1950s) • Direct word-to-word translation • Recent statistical MT systems • Transfer • Almost all research and commercial systems <= 1990 • Interlingual

  6. MT: System Types • Interlingual • A few research systems in the 1980s • Rosetta (Philips), based on Montague Grammar • Semantic derivation trees of attuned grammars • Distributed Translation (BSO) • (enriched) Esperanto • Sometimes logical representations • Hybrid Interlingual/Transfer • Transfer for lexicons; IL for rules

  7. Rule-Based Systems • Most systems • explicit source language grammar • parser yields analysis of source language input • transfer component turns it into target language structure • no explicit grammar of target language (except morphology)

  8. Rule-Based Systems • Some systems (Eurotra) • explicit source and target language grammar • sometimes reversible • parser yields analysis of source language input • transfer component turns it into target language structure • generation of translation by target language grammar

  9. Rule-Based Systems • Some systems (Rosetta, DLT) • explicit source and target language grammar • in some cases reversible • parser yields interlingual representation • generation of translation by target language grammar from interlingual representation

  10. MT: Is it difficult? • FAHQT: Fully Automatic High Quality Translation • Fully Automatic: no human intervention • High Quality: close or equal to human translation • Even acceptable quality is difficult to achieve

  11. MT: Why is it so difficult? • Ambiguity • Real • Temporary • Computational Complexity • Complexity of language • Divergences • Language Competence v. Language Use • Require large and rich lexicons

  12. MT: Why is it so difficult? • De jongen sloeg het meisje met de gitaar • Hij heeft boeken gelezen • Hij heeft uren gelezen • He has been reading books • *He has been reading for books • *He has been reading hours • He has been reading for hours

  13. MT: Why is it so difficult? • Uren: not only also • dagen, de hele dag, weken, … • (Words expressing units of time) • But also: • De hele vergadering, meeting, bijeenkomst, les, … • (words expressing events)

  14. MT: Why is it so difficult? • Hij draagt een bruin pak • Dragen: wear or carry • Pak: suit or package • Hij draagt een bruin pak en zwarte schoenen • Hij draagt een bruin pak onder zijn arm

  15. MT: Why is it so difficult? • Voert uw bedrijf sloten uit? • Uitvoeren:execute, or export? • Bedrijf: act, or company? • Sloten: ditches, or locks?

  16. MT: Why is it so difficult? • Temporary Ambiguity • Hij heeft boeken gelezen • Heeft: main or auxiliary verb? • Boeken: noun or verb • Voert uw bedrijf sloten uit? • Voert: form of voeren or of uitvoeren, • Bedrijf: noun or verb form? • Sloten uit: noun+particle or PP: out of ditches/locks

  17. Why is MT difficult? • Ambiguity of natural language Summary • requires modeling of knowledge of the world /situation • by rule systems, and/or • by statistics

  18. MT: Why is it so difficult? • Computational Complexity • High demands of processing capacity • High demands on memory • Complexity of language • Many different construction types • All interacting with each other

  19. Why is MT difficult? • Divergences between language • require deep syntactic analysis • Or very sophisticated statistical techniques

  20. Divergences: Category mismatches • Simple category mismatches • woonachtig (zijn) v. reside (Adj – Verb) • zich ergeren v. (be) annoyed (Verb-Adj) • verliefd v. in love (Adj- Prep+Noun) • kunnen v. (be) able • kunnen v. (be) possible • door- v. continue (to)

  21. Divergences: Category mismatches • More complex category mismatches • graag vs. like (Adv vs. Verb) • hij zwemt graag vs. he likes to swim • toevallig vs. happen • hij viel toevalligvs. he happened to fall

  22. Divergences: Category mismatches • Phrasal category mismatches • de zieke vrouw • the woman who is ill (* the ill woman) • I expect her to leave • ik verwacht dat zij vertrekt • She is likely to come • het is waarschijnlijk dat zij komt

  23. Conflational Divergences: • prepositional complements • houden vanvs. love • existential er vs. Ø • er passeerde een auto vs. • a car passed • verbal particles • blow (something) up vs. volar

  24. Conflational Divergences: • reflexive verbs • zich scheren vs. shave • composed vs. simple tense forms • he will do it vs. lo hará • split negatives vs. composed negatives • he does not see anyone vs. • hij ziet niemand

  25. Functional Divergences: • I like these apples • me gustan estas manzanas • se venden manzanas aqui • hier verkoopt men appels • er werd door de toeschouwers gejuicht • the spectators were cheering

  26. Divergences: MWEs • semi-fixed MWEs • nuclear power plant vs. kerncentrale • flexible idioms • de plaat poetsen vs. bolt • de pijp uit gaan v. to kick the bucket

  27. Divergences: MWEs • semi-idioms (collocations) • zware shag vs. strong tobacco • semi-idioms (support verbs) • aandacht besteden aan • pay attention to

  28. MT: Why is it so difficult? • Language Competence v. Language Use • Earlier systems implemented idealized reality • But not the really occurring language use • In some cases • focus on theoretically interesting difficult constructions • That do occur in reality • But other constructions are more important to deal with in practical systems

  29. MT: Why is it so difficult? • Large and rich lexicons • Existing human-oriented dictionaries are not suited as such • All information must be available in a formalized way • Much more information is needed than in a traditional dictionary

  30. MT: Why is it so difficult? • Multi-word Expressions (MWEs) • Are in current dictionaries only in a very informal way • No standards on how to represent them lexically • Many different types requiring different treatment in the grammar • Huge numbers!! • Domain and company-specific terminology are often MWEs

  31. MT: Can we make it possible? • Probably not, • but we can still improve significantly • Lexicons • Selection restrictions • Approximating analyses • Statistical MT

  32. MT: Can we make it possible? • Large and rich lexicons • widely accepted and used (de facto) standards • Methods and tools to quickly adapt to domain or company specific vocabulary • Better treatment of MWEs and standards for lexical representation of MWEs

  33. MT: Can we make it possible? • Selection restrictions with type system to approach modeling of world knowledge • Requires sophisticated syntactic analysis • Boek: info (legible) • Uur: time unit  duration • Vergadering: event  duration • Lezen: subject=human; object=info (legible) • Durational adjunct must be a duration phrase

  34. MT: Can we make it possible? • Selection restrictions • Pak (1) (suit): cloths • Pak (2) (package): entity • Dragen (1) (wear): subj=animate; object=cloths • Dragen (2) (carry): subj=animate; object= entity • Schoen: cloths • Entity > cloths • Identity preferred over subsumption • Homogeneous object preferred over heterogeneous one

  35. MT: Can we make it possible? • Selection restrictions • Hij draagt een bruin pak • He wears a brown suit (1: cloths=cloths) • He carries a brown package (1: entity=entity) • He carries a brown suit (2: entity > cloth) • *He wears a brown package (cloth ¬> entity) • Hij draagt een bruin pak en zwarte schoenen • He wears a brown suit and black shoes (1: homogeneous and cloths=cloths) • He carries a brown suit and black shoes (2: homogeneous but entity > cloths) • He carries a brown package and black shoes(2: inhomogeneous but entity=entity) • *He wears a brown package and black shoes (cloths ¬> entity)

  36. MT: Can we make it possible? • Approximating analyses • Ignore certain ambiguities to begin with • Use only limited amount of relevant information • Cut off analysis when there are too many alternatives • This is currently actually done in all practical systems • Need new ways of doing this without affecting quality too seriously

  37. MT: Can we make it possible? • Statistical MT • Derives MT-system automatically • From statistics taken from • Aligned parallel corpora ( translation model) • Monolingual target language corpora ( language model) • Being worked since early 90’s

  38. MT: Can we make it possible? • Plus: • No or very limited grammar development • Includes language and world knowledge automatically (but implicitly) • Based on actually occurring data • Currently many experimental and commercial systems • Minus: • Requires large aligned parallel corpora • Unclear how much linguistics will be needed anyway • Probably restricted to very limited domains only

  39. MT: Can we make it possible? • Google Translate (statistical MT) • Hij draagt een pak.  √He wears a suit. • Hij draagt schoenen.  √ He wears shoes. • Hij draagt bruine schoenen en een pak. •  √ He wears a suit and brown shoes. (!!) • Hij draagt het pakket  √ He carries the package • Hij heeft een pak aan.  *He has a suit. • Voert uw bedrijf sloten uit? •  *Does your company locks out?

  40. MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” • Lists data and tools for European language pairs • Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches

  41. MT: Can we make it possible? • Euromatrix esp. “the Euromatrix” • Lists data and tools for European language pairs • Goals • Translation systems for all pairs of EU languages • Organization, analysis and interpretation of a competitive annual international evaluation of machine translation • The provision of open source machine translation technology including research tools, software and data • A systematically compiled and constantly updated detailed survey of the state of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation • The development and testing of hybrid architectures for the integration of rule-based and statistical approaches • Successor project EuromatrixPlus

  42. MT: Can we make it possible? • META-NET 2010-2013 (EU-funding) • Building a community with shared vision and strategic research agenda • Building META-SHARE, an open resource exchange facility • Building bridges to neighbouring technology fields • Bringing more Semantics into Translation • Optimising the Division of Labour in Hybrid MT • Exploiting the Context for Translation • Empirical Base for Machine Translation

  43. MT: Can we make it possible? • PACO-MT 2008-2011 • Investigates hybrid approach to MT • Rule-based and statistical • Uses existing parser for source language analysis • Uses statistical n-gram language models for generation • Uses statistical approach to transfer

  44. MT Evaluation • Evaluation depends on purpose of MT and how it is used • application, domain, controlled language • Many aspects can be evaluated • functionality, efficiency, usability, reliability, maintainability, portability • translation quality • embedding in work flow • post-editing options/tools

  45. MT Evaluation • Focus here: • does the system yield good translations according to human judgement • in the context of developing a system • Again, many aspects: • fidelity (how close), correctness, adequacy, informativeness, intelligibility, fluency • and many ways to measure these aspects

  46. MT Evaluation • Test suite • Reference = • list of (carefully selected) sentences • with their translations (ordered by score) • translations judged correct by human (usually developer) • upon every update of the system output of the new system is compared to the reference • if different: system has to be adapted, or reference has to be adapted • Advantages • focus on specific translation problems possible • excellent for regression testing • Manual judgement needed only once for each new output • –other comparisons are automatic • Disadvantages • not really independent • particularly suited for pure rule-based systems • human judgement needed if output differs from reference

  47. MT Evaluation • Comparison against • translation corpus • independently created by human translators • possibly multiple equivalently correct translations of a sentence • Advantages • truely independent • also suited for data-driven systems • Disadvantage • requires human judgement (every time there is a system update) • high effort by highly skilled people, high costs, requires a lot of time • human judgement is not easy (unless there is a perfect match) • Useful • for a one-time evaluation of a stable system • not for evaluation during development

  48. MT Evaluation • Edit-Distance (Word Accuracy) • metric to determine closeness of translations automatically • the least number of edit operations to turn the translated sentence into the reference sentence • Alshawi et al. 1998

  49. MT Evaluation • WA = 1- ((d+s+i)/max(r,c)) • d= number of deletions • s = number of substitutions • i = number of insertions • r = reference sentence length • c = candidate sentence length • easy to calculate using Levenshtein distance algorithm (dynamic programming) • various extensions have been proposed

  50. MT Evaluation • Advantages • fully automatic given a reference set • Disadvantages • penalizes candidates if a synonym is used • penalizes swaps of words and block of words too much

More Related