580 likes | 1.28k Views
Chapter 21: Machine Translation. Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran. What is MT?. Machine Translation (MT) means translation using computers. Machine-aided human translation (MAHT) Human-aided machine translation (HAMT) Fully automated machine translation (FAMT)
E N D
Chapter 21: Machine Translation Heshaam Faili hfaili@ece.ut.ac.ir University of Tehran
What is MT? • Machine Translation (MT) means translation using computers. • Machine-aided human translation (MAHT) • Human-aided machine translation (HAMT) • Fully automated machine translation (FAMT) • Fully human translation
Some definitions • “Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another.” EAMT • “…Machine Translation (MT) as it is generally known --- the attempt to automate all, or part of the process of translating from one human language to another.” Arnold D J. MACHINE TRANSLATION: An Introductory Guide • “…presumably means going by algorithm from machine-readable source text to useful target text, without recourse to human translation or editing." ALPAC report, 1966
Different tasks with MT • Tasks which rough translation is adequate • Tasks where a human post-editor is used • Tasks limited to small sublanguage domains in which fully automatic high quality translation (FAHQT) is still achievable • Tasks with Software Localization …
Machine Translation History • 1946-1954: Optimistic attitude towards the new technologies in MT • 1949: Informal Memorandum • Word-to-word translation especially Russian-English • 1954: The demonstration of the Georgetown University • Vocabulary: 250 words, Grammar: 6 rules, Corpus: a few simple Russian sentences
Machine Translation History • 1954-1966: Criticism on the subject of MT • 1966 ALPAC-Report (Automatic Language Processing Advisory Committee) • MT is slower, not very reliable and twice as expensive as human translation
Machine Translation History • 1966-1975: Revision of the aims and goals of MT • Definition of more realistic goals • Limitation of the research to technical languages • Syntactical analysis of the source text • Development of different translation strategies
Machine Translation History • 1975-1989 ±: Increasing interest and promotion for MT • Rapid increase of the demand for translations • Improvements in hard- and software • The use of artifical intelligence methodes is now possilbe
Machine Translation History • 1990-2000 • Development of comercial products based on personal computers • Specialized supplementary information (medicine, law, economics...) • Translation of spoken language (VERBMOBIL)
Machine Translation History • 2000-Now • Statistical Approaches and Hybrid Models • Google Translation Engine ( http://translate.google.com ) • Yearly MT Official Evaluation race ( http://www.nist.gov ) • Automated MT Evaluation (NIST, BLEU)
What happened between ALPAC and Now? • Need for MT and other NLP applications confirmed • Change in expectations • Computers have become faster, more powerful • WWW • Political state of the world • Maturation of Linguistics • Development of hybrid statistical/symbolic approaches
Language Similarities or Differences • Universal: some aspects which is true for every language • Every Language has words referring to people, or every language has nouns or verbs • Typology: Study of systematic cross-linguistics similarities and differences • Morphology Aspects: • isolating Vs. Polysynthetic • Agglutinative Vs. fusion • Syntactical Aspects: • SVO , SOV or VSO • Syntactical-Morphological Aspects: • Head-Marking Vcs. Dependent-marking • Specific differences: Date Format and Standards, verb tense differences, • Lexical Differences : Different scenes
Lexical Differences English: leg, foot, paw French: etape, patte, jambe, pied
Different Machine Translation Systems • Rule-based • Statistical Approaches • Hybrid Systems (Using Statistical approach in an Rule-based Architecture or … )
Machine Translation Architectures • Direct architecture • Direct architecture was used for most MT systems of the first generation • there are no intermediate stages in the process of translation
Machine Translation Architectures • Characteristics of direct MT systems: • no complex linguistic theories or parsing strategy • make use of syntactic, semantic and lexical similarities between the source and the target-language • based on a single language pair • direct MT systems are ´robust`, they even translate sentences with incomplete information • dictionaries are the most important components of the direct MT systems
Machine Translation Architectures • Transfer architecture • Itconsists of three separate stages: • analysis • Transfer (Syntactical or Lexical) • synthesis/generation
Transfer Example: eng->SpanishMary did not slap the green witch
Persian Example • I ate the apple من سیب را خوردم • VP V NP VP NP RA V • I asked the man من از سیب خوردم • VP V NP VP AZ NP V
Machine Translation Architectures • Characteristics of transfer MT systems: • consist of complete linguistic conceptions, not only single grammatical or syntactic rules • the analysis and generation components can be used again for further language pairs, if the components are exactly separated • the dictionaries of the transfer MT systems are also separated
Machine Translation Architectures • Interlingua architecture • The interlingua system consists of two stages: • The source text is analysed into an interlingual representation from which the text of the target language will be directly generated • Semantic Analyzer
Machine Translation Architectures • Interlingua architecture: • Advantage: • The interlingua representation can be used for any other language • Disadvantage: • It is difficult to create language-independent representations
Statistical Approach • 3 stages: • Language model P(E) • Translation model P(F|E) • Decoder
SYSTRAN • Developed in the late 1950s by Peter Toma • Initial system for Russian-English translations • Later adapted for US Air Force and NASA • Adaptation for other languages • Important because it had a big influence on many Japanese MT systems
SYSTRAN • Rule-based System • Using finite state grammar (ATN) • Using a large knowledge-base • Working on 23 languages specially UE languages • Customers: AltaVista, Lycos, AOL, Compuserve, Terra, Google, Apple و...
AppTek TranSphere ® • Rule-based System • Using LFG (Lexical Functional Grammar) • Analyze the semantic, morphological and syntactic structures in English and produce their equivalents in the target language • Utilize a general-purpose lexicon in addition to special domain micro-dictionaries • Translate English to Arabic, Korean, Chinese, Turkish, Persian/Dari and Pashto-English • Bi-Translate French, German, Italian, Portuguese, Russian, Spanish, Ukrainian, Hebrew and Dutch
MÉTÉO • Development of an English-French translation system by the TAUM Group to cope with the bilingual policy of the Canadian government • 1975 Contract to develop a system to translate public weather forecasts • 1984 Development of Météo 2 • This program proved to be more reliable, faster and more cost-effective • 1989 Development of a French-English version
Sakhr Enterprise Machine Translation • Using transfer Architecture • analysis on all linguistic levels: morphological, lexical, syntactic and semantic • Arabic - English
CiyaTran MT • English - Arabic-scripts languages : Arabic-Persian-Pashto • Analyzing the semantic, morphological and syntactical structure of input text • Utilizing Fuzzy Logic and Statistical Analysis • Using a general-purpose lexicon, as well as 85 domain-specific databases with over 3,000,000 words and phrases
ARIANE (GETA) • 1960-1970: Development of CETA System for three language pairs • Change of the name to ARIANE (GETA) as the system was changed into a ‘Transfer’ system
EUROTRA • Developed for the translation requirements within the European Community • A system designed to replace the Systran system because of its several limitations • 3 phases in the development of the program • One of the biggest MT project regarding expenditure, organizations and people involved
Google Translation • Lunched on 2004 • Beta version on English Arabic and English Chinese • Fully Statistical • Commercial usage : no technical document found • On 2005, become the best translator on these two language : http://www.nist.gov
Shiraz Project • This project involved the creation of an extensible research prototype of a Persian to English machine translation system • Persian to English • Transfer Based Translation • Syntactic Analysis • Unification Based context free grammar • Stopped …
Moses statisticalMT • Open source with C++ • allows you to automatically train translation models for any language pair. • All you need is a collection of translated texts (parallel corpus). • beam-search • phrase-based
PSMT (Prolog Statistical Machine Translation) • Used Prolog to Translate simple structures • 3 sections: • Language Model Learner • Dictionary Learner • Search Program
Phramer Statistical Machine Translation • Phrase-based • Open-Source with Java • Using Bayesian model
EGYPT • Statistical MT • French-English • Academic • Some workshops related to EGYPT established
MT Challenges: Ambiguity • Syntactic AmbiguityI saw the man with the telescope S S NP VP NP VP VP PP V NP I I PP V NP With the telescope NP saw With the telescope saw the man the man
MT Challenges: Ambiguity • Syntactic AmbiguityI saw the man on the hill with the telescope • Lexical Ambiguity E: book • Semantic Ambiguity • Homography:ball(E) = pelota, baile(S) • Polysemy:kill(E), matar, acabar (S) • Semantic granularityesperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)
How do we evaluate MT? • Human-based Metrics • Semantic Invariance • Pragmatic Invariance • Lexical Invariance • Structural Invariance • Spatial Invariance • Fluency • Accuracy • “Do you get it?” • Automatic Metrics: Bleu
BiLingual Evaluation Understudy (BLEU —Papineni, 2001) • Automatic Technique, but …. • Requires the pre-existence of Human (Reference) Translations • Produce corpus of high-quality human translations • Judge “closeness” numerically (word-error rate) • Compare n-gram matches between candidate translation and 1 or more reference translations