300 likes | 426 Views
Machine Translation. Dai Xinyu 2006-10-27. Outline. Introduction Architecture of MT Rule-Based MT vs. Data-Driven MT Evaluation of MT Development of MT MT problems in general Some Thinking about MT from recognition.
E N D
Machine Translation Dai Xinyu 2006-10-27
Outline • Introduction • Architecture of MT • Rule-Based MT vs. Data-Driven MT • Evaluation of MT • Development of MT • MT problems in general • Some Thinking about MT from recognition
"I have a text in front of me which is written in Russian but I am going to pretend that it is really written in English and that it has been coded in some strange symbols. All I need do is strip off the code in order to retrieve the information contained in the text" Introduction • machine translation - the use of computers to translate from one language to another • The classic acid test for natural language processing. • Requires capabilities in both interpretation and generation. • About $10 billion spent annually on human translation. • http://www.google.com/language_tools?hl=en
Introdution - MT past and present • mid-1950's - 1965: • Great expectations • The dark ages for MT: • Academic research projects • 1980's - 1990's: • Successful specialized applications • 1990's: • Human-machine cooperative translation • 1990's - now: • Statistical-based MT • Hybrid-strategies MT • Future prospects: • ???
Interest in MT • Commercial interest: • U.S. has invested in MT for intelligence purposes • MT is popular on the web—it is the most used of Google’s special features • EU spends more than $1 billion on translation costs each year. • (Semi-)automated translation could lead to huge savings
Interest in MT • Academic interest: • One of the most challenging problems in NLP research • Requires knowledge from many NLP sub-areas, e.g., lexical semantics, parsing, morphological analysis, statistical modeling,… • Being able to establish links between two languages allows for transferring resources from one language to another
Related Area to MT • Linguistics • Computer Science • AI • Compile • Formal Semantics • … • Mathematics • Probability • Statistics • … • Informatics • Recognition
Rule-Based MT vs. Data-Driven MT • Rule-Based MT • Data-Driven MT • Example-Based MT • Statistics-Based MT
Rule-Based MT 语言学 语义学 认知科学 人工智能 写规则 规则 自然语言输入 翻译系统 翻译结果
Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Man, this is so boring. Translated documents
Example-Based MT • origins: Nagao (1981) • first motivation: collocations, bilingual differences of syntactic structures • basic idea: • human translators search for analogies (similar phrases) in previous translations • MT should seek matching fragment in bilingual database, extract translations • aim to have less complex dictionaries, grammars, and procedures • improved generation (using actual examples of TL sentences)
EBMT still going • Bi-lingual corpus Collection • Store • Searching and matching • …
Statistical MT Basics • Based on assumption that translations observed statistical regularities • origins: Warren Weaver (1949) • Shannon’s information theory • core process is the probabilistic ‘translation model’ taking SL words or phrases as input, and producing TL words or phrases as output • succeeding stage involves a probabilistic ‘language model’ which synthesizes TL words as ‘meaningful’ TL sentences
Statistical MT 统计学习 建立模型 自然语言输入 概率模型 学习系统 预测 自然语言输入 预测系统
Statistical MT processes • Bilingual corpora: original and translation • little or no linguistic ‘knowledge’, based on word co-occurrences in SL and TL texts (of a corpus), relative positions of words within sentences, length of sentences • Alignment: sentences aligned statistically (according to sentence length and position) • Decoding: compute probability that a TL string is the translation of a SL string (‘translation model’), based on: • frequency of co-occurrence in aligned texts of corpus • position of SL words in SL string • Adjustment: compute probability that a TL string is a valid TL sentence (based on a ‘language model’ of allowable bigrams and trigrams) • search for TL string that maximizes these probabilities argmaxeP(e/f) = argmaxeP (f/e) P (e)
Language Modeling • Determines the probability of some English sequence of length l • P(e) is normally approximated as: where m is size of the context, i.e. number of previous words that are considered, m=1, bi-gram language model m=2, tri-gram language model
Translation Modeling • Determines the probability that the foreign word f is a translation of the English word e • How to compute P(f | e) from a parallel corpus? • Statistical approaches rely on the co-occurrence of e and f in the parallel data: If e and f tend to co-occur in parallel sentence pairs, they are likely to be translations of one another
SMT issues • ignores previous MT research (new start, new ‘paradigm’) • basically ‘direct’ approach: • replaces SL word by most probable TL word, • reorders TL words • decoding is effectively kind of ‘back translation’ • originally wholly word-based (IBM ‘Candide’ 1988) ; now predominantly phrase-based (i.e. alignment of word groups); some research on syntax-based • mathematically simple, but huge amount of training (large databases) • problems for SMT: • translation is not just selecting the most frequent ‘equivalent’ (wider context) • no quality control of corpora • lack of monolingual data for some languages • insufficient bilingual data (Internet as resource) • lack of structure information of language • merit of SMT: evaluation as integral process of system development
Rule-Based MT & SMT • SMT black box: no way of finding how it works in particular cases, why it succeeds sometimes and not others • RBMT: rules and procedures can be examined • RBMT and SMT are apparent polar opposites, but gradually ‘rules’ incorporated in SMT models • first, morphology (even in versions of first IBM model) • then, ‘phrases’ (with some similarity to linguistic phrases) • now also, syntactic parsing
Rule-Based MT & SMT • Comparison from following perspectives: • Theory background • Knowledge expression • Knowledge discovery • Robust • Extension • Development Cycle
Evaluation of MT • Manual: • Precise / fluency / integrality • 信 达 雅 • Automatically evaluation: • BLEU: percentage of word sequences (n-grams) occurring in reference texts • NIST
Shallow/ Simple MT Development - Research Original statistical MT Word-based only Electronic dictionaries Example-based MT Phrase tables Knowledge Acquisition Strategy Hand-built by experts Hand-built by non-experts Learn from annotated data Learn from un-annotated data All manual Fully automated Original direct approach Syntactic Constituent Structure Typical transfer system Semantic analysis New Research Goes Here! Classic interlingual system Interlingua Knowledge Representation Strategy Deep/ Complex
MT problems in general • Characters of language • Ambiguous • Dynamic • Flexible • Knowledge • How to express • How to discovery • How to use
Some Thinking about MT from recognition • Human Cerebra • Memory • Progress - Learning • Model • Pattern • Translation by human… • Translation by machine…
Further Reading • Arturo Trujillo, Translation Engines: Techniques for Machine Translation, Springer-Verlag London Limited 1999 • P.F. Brown, et al., A Statistical Approach to MT, Computational Linguistics, 1990,16(2) • P.F. Brown, et al., The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993, 19(2) • Bonnie J. Dorr, et al, Survey of Current Paradigms in Machine Translation • Makoto Nagao, A Framework of a Mechanical Translation between Japanese and English by Analog Principle, In A. Elithorn and R. Banerji(Eds.), Artificial and Human Intelligence. NATO Publications, 1984 • Hutchins WJ, Machine Translation: Past, Present, Future. Chichester: Ellis Horwood, 1986 • Daniel Jurafsky & James H. Martin, Speech and Language Processing, Prentice-Hall, 2000 • Christopher D. Manning & Hinrich Schutze, Foundations of Statistical Natural Langugae Processing, Massachusetts Institute of Technology, 1999 • James Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc. 1987