1 / 29

Machine Translation

Machine Translation. Dai Xinyu 2006-10-27. Outline. Introduction Architecture of MT Rule-Based MT vs. Data-Driven MT Evaluation of MT Development of MT MT problems in general Some Thinking about MT from recognition.

lobo
Download Presentation

Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Translation Dai Xinyu 2006-10-27

  2. Outline • Introduction • Architecture of MT • Rule-Based MT vs. Data-Driven MT • Evaluation of MT • Development of MT • MT problems in general • Some Thinking about MT from recognition

  3. "I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need do is strip off the code in order to retrieve the information contained in the text" Introduction • machine translation - the use of computers to translate from one language to another • The classic acid test for natural language processing. • Requires capabilities in both interpretation and generation. • About $10 billion spent annually on human translation. • http://www.google.com/language_tools?hl=en

  4. Introdution - MT past and present • mid-1950's - 1965: • Great expectations • The dark ages for MT: • Academic research projects • 1980's - 1990's: • Successful specialized applications • 1990's: • Human-machine cooperative translation • 1990's - now: • Statistical-based MT • Hybrid-strategies MT • Future prospects: • ???

  5. Interest in MT • Commercial interest: • U.S. has invested in MT for intelligence purposes • MT is popular on the web—it is the most used of Google’s special features • EU spends more than $1 billion on translation costs each year. • (Semi-)automated translation could lead to huge savings

  6. Interest in MT • Academic interest: • One of the most challenging problems in NLP research • Requires knowledge from many NLP sub-areas, e.g., lexical semantics, parsing, morphological analysis, statistical modeling,… • Being able to establish links between two languages allows for transferring resources from one language to another

  7. Related Area to MT • Linguistics • Computer Science • AI • Compile • Formal Semantics • … • Mathematics • Probability • Statistics • … • Informatics • Recognition

  8. Architecture of MT -- (Levers of Transfer)

  9. Rule-Based MT vs. Data-Driven MT • Rule-Based MT • Data-Driven MT • Example-Based MT • Statistics-Based MT

  10. Rule-Based MT 语言学 语义学 认知科学 人工智能 写规则 规则 自然语言输入 翻译系统 翻译结果

  11. Rule-Based MT

  12. Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Man, this is so boring. Translated documents

  13. Example-Based MT • origins: Nagao (1981) • first motivation: collocations, bilingual differences of syntactic structures • basic idea: • human translators search for analogies (similar phrases) in previous translations • MT should seek matching fragment in bilingual database, extract translations • aim to have less complex dictionaries, grammars, and procedures • improved generation (using actual examples of TL sentences)

  14. EBMT still going • Bi-lingual corpus Collection • Store • Searching and matching • …

  15. Statistical MT Basics • Based on assumption that translations observed statistical regularities • origins: Warren Weaver (1949) • Shannon’s information theory • core process is the probabilistic ‘translation model’ taking SL words or phrases as input, and producing TL words or phrases as output • succeeding stage involves a probabilistic ‘language model’ which synthesizes TL words as ‘meaningful’ TL sentences

  16. Statistical MT 统计学习 建立模型 自然语言输入 概率模型 学习系统 预测 自然语言输入 预测系统

  17. Statistical MT schema

  18. Statistical MT processes • Bilingual corpora: original and translation • little or no linguistic ‘knowledge’, based on word co-occurrences in SL and TL texts (of a corpus), relative positions of words within sentences, length of sentences • Alignment: sentences aligned statistically (according to sentence length and position) • Decoding: compute probability that a TL string is the translation of a SL string (‘translation model’), based on: • frequency of co-occurrence in aligned texts of corpus • position of SL words in SL string • Adjustment: compute probability that a TL string is a valid TL sentence (based on a ‘language model’ of allowable bigrams and trigrams) • search for TL string that maximizes these probabilities argmaxeP(e/f) = argmaxeP (f/e) P (e)

  19. Language Modeling • Determines the probability of some English sequence of length l • P(e) is normally approximated as: where m is size of the context, i.e. number of previous words that are considered, m=1, bi-gram language model m=2, tri-gram language model

  20. Translation Modeling • Determines the probability that the foreign word f is a translation of the English word e • How to compute P(f | e) from a parallel corpus? • Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another

  21. SMT issues • ignores previous MT research (new start, new ‘paradigm’) • basically ‘direct’ approach: • replaces SL word by most probable TL word, • reorders TL words • decoding is effectively kind of ‘back translation’ • originally wholly word-based (IBM ‘Candide’ 1988) ; now predominantly phrase-based (i.e. alignment of word groups); some research on syntax-based • mathematically simple, but huge amount of training (large databases) • problems for SMT: • translation is not just selecting the most frequent ‘equivalent’ (wider context) • no quality control of corpora • lack of monolingual data for some languages • insufficient bilingual data (Internet as resource) • lack of structure information of language • merit of SMT: evaluation as integral process of system development

  22. Rule-Based MT & SMT • SMT black box: no way of finding how it works in particular cases, why it succeeds sometimes and not others • RBMT: rules and procedures can be examined • RBMT and SMT are apparent polar opposites, but gradually ‘rules’ incorporated in SMT models • first, morphology (even in versions of first IBM model) • then, ‘phrases’ (with some similarity to linguistic phrases) • now also, syntactic parsing

  23. Rule-Based MT & SMT • Comparison from following perspectives: • Theory background • Knowledge expression • Knowledge discovery • Robust • Extension • Development Cycle

  24. Evaluation of MT • Manual: • Precise / fluency / integrality • 信 达 雅 • Automatically evaluation: • BLEU: percentage of word sequences (n-grams) occurring in reference texts • NIST

  25. Development of MT - MT System

  26. Shallow/ Simple MT Development - Research Original statistical MT Word-based only Electronic dictionaries Example-based MT Phrase tables Knowledge Acquisition Strategy Hand-built by experts Hand-built by non-experts Learn from annotated data Learn from un-annotated data All manual Fully automated Original direct approach Syntactic Constituent Structure Typical transfer system Semantic analysis New Research Goes Here! Classic interlingual system Interlingua Knowledge Representation Strategy Deep/ Complex

  27. MT problems in general • Characters of language • Ambiguous • Dynamic • Flexible • Knowledge • How to express • How to discovery • How to use

  28. Some Thinking about MT from recognition • Human Cerebra • Memory • Progress - Learning • Model • Pattern • Translation by human… • Translation by machine…

  29. Further Reading • Arturo Trujillo, Translation Engines: Techniques for Machine Translation, Springer-Verlag London Limited 1999 • P.F. Brown, et al., A Statistical Approach to MT, Computational Linguistics, 1990,16(2) • P.F. Brown, et al., The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993, 19(2) • Bonnie J. Dorr, et al, Survey of Current Paradigms in Machine Translation • Makoto Nagao, A Framework of a Mechanical Translation between Japanese and English by Analog Principle, In A. Elithorn and R. Banerji(Eds.), Artificial and Human Intelligence. NATO Publications, 1984 • Hutchins WJ, Machine Translation: Past, Present, Future. Chichester: Ellis Horwood, 1986 • Daniel Jurafsky & James H. Martin, Speech and Language Processing, Prentice-Hall, 2000 • Christopher D. Manning & Hinrich Schutze, Foundations of Statistical Natural Langugae Processing, Massachusetts Institute of Technology, 1999 • James Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc. 1987

More Related