400 likes | 586 Views
CIS 530 - Intro to NLP. 2. Why use computers in translation?. Too much translation for humansTechnical materials too boring for humansGreater consistency requiredNeed results more quicklyNot everything needs to be top qualityReduce costsAny one of these may justify machine translatio
E N D
1. Introduction to Machine Translation Mitch Marcus
CIS 530
Some slides adapted from slides by John Hutchins, Bonnie Dorr, Martha Palmer Language Weaver, Kevin Knight
2. CIS 530 - Intro to NLP
2 Why use computers in translation? Too much translation for humans
Technical materials too boring for humans
Greater consistency required
Need results more quickly
Not everything needs to be top quality
Reduce costs
Any one of these may justify machine translation or computer aids
3. CIS 530 - Intro to NLP
3 The Early History of NLP (Hutchins): MT in the 1950s and 1960s Sponsored by government bodies in USA and USSR (also CIA and KGB)
assumed goal was fully automatic quality output (i.e. of publishable quality) [dissemination]
actual need was translation for information gathering [assimilation]
Survey by Bar-Hillel of MT research:
criticised assumption of FAHQT as goal
demonstrated ‘non-feasibility’ of FAHQT (without ‘unrealisable’ encyclopedic knowledge bases)
advocated “man-machine symbiosis”, i.e. HAMT and MAHT
ALPAC 1966, set up by disillusioned funding agencies
compared latest systems with early unedited MT output (IBM-GU demo, 1954), criticised for still needing post-editing
advocated machine aids, and no further support of MT research
but failed to identify the actual needs of funders [assimilation]
therefore failed to see that output of IBM-USAF Translator and Georgetown systems were used and appreciated
4. CIS 530 - Intro to NLP
4 Consequences of ALPAC MT research virtually ended in US
identification of actual needs
assimilation vs. dissemination
recognition that ‘perfectionism’ (FAHQT) had neglected:
operational factors and requirements
expertise of translators
machine aids for translators
henceforth three strands of MT:
translation tools (HAMT, MAHT)
operational systems (post-editing, controlled languages, domain-specific systems)
research (new approaches, new methods)
computational linguistics born in the aftermath
5. Review from Introductory Lectures
6. CIS 530 - Intro to NLP
6
7. CIS 530 - Intro to NLP
7 How A Statistical MT System Learns
8. CIS 530 - Intro to NLP
8 Translating a New Document
9. CIS 530 - Intro to NLP
9 Latest Methods: Language Weaver Chinese Source TextSample 1: ??,?????????????????????“???”??????????24????1200??????,8???40??????,???????????,?????????????????? “????????????????”????,???????????????????????,?????????????????? ?????????????????«?????»?????????,35????????35%;62.1%??????,????????????????,????????????????????;76%????????????????? ????,??????????????????????????????,?4??,???????26.79??,??27.6%;???4??,? ???9??,????50.5%,???????????????????????,???,?????????????18.8??,???? 22%,????105%;?????????1.1??,?????25%? Language Weaver Experimental Syntax MT Sample 1 : The motor show, has always been the' barometer' of a national car consumption and market potential. The Beijing International Auto Show has more than 1,200 exhibitors from 24 countries and 8 days of receiving more than 40 million visitors, setting a new record in China's auto show, are deeply aware of the automobile market signals. "China is one of the largest automobile market in the world. Over the years, this phrase implies more auto businessmen. But now, more and more facts indicates that it is to become a reality. Data from the Motor Show is very convincing. The Beijing Qingnian Bao Report on-the-spot investigation showed that about 35 percent of 35-year-old visitors, 62.1 percent of the respondents said that the truck was mainly to buy a car in the near future to collect information, even at the exhibition may purchase or suitable products; 76% of respondents indicated in the past two years to buy private cars. Since the beginning of this year, the strong growth of the domestic car market. According to the figures released by the National Bureau of Statistics, in the first four months, the country produced 267,900 vehicles, up 27.6 percent; in particular, in April, the production of 90,000 vehicles, an increase of 50.5% over the same period last year, setting a record high for the monthly output growth over the past 10-odd years. In terms of sales in the first quarter, manufacturing enterprises in the country sold 188,000 cars, up 22 percent over the same period of last year, up 10.5 percent; 11,000 vehicles, dropping by nearly 25 percent lower than the beginning of the year.
10. CIS 530 - Intro to NLP
10
11. Introduction: Approaches & Difficulties
12. CIS 530 - Intro to NLP
12 MT Challenges: Ambiguity Syntactic AmbiguityI saw the man on the hill with the telescope
Lexical Ambiguity
E: book
S: libro, reservar
Semantic Ambiguity
Homography:ball(E) = pelota, baile(S)
Polysemy:kill(E), matar, acabar (S)
Semantic granularityesperar(S) = wait, expect, hope (E)be(E) = ser, estar(S)fish(E) = pez, pescado(S)
13. CIS 530 - Intro to NLP
13 MT Challenges: Divergences
14. CIS 530 - Intro to NLP
14
15. CIS 530 - Intro to NLP
15 Divergence Frequency 32% of sentences in UN Spanish/English Corpus (5K)
35% of sentences in TREC El Norte Corpus (19K)
Divergence Types
Categorial (X tener hambre ? X have hunger) [98%]
Conflational (X dar puñaladas a Z ? X stab Z) [83%]
Structural (X entrar en Y ? X enter Y) [35%]
Head Swapping (X cruzar Y nadando ? X swim across Y) [8%]
Thematic (X gustar a Y ? Y like X) [6%]
16. CIS 530 - Intro to NLP
16 MT Lexical Choice- WSD
Iraq lost the battle.
Ilakuka centwey ciessta.
[Iraq ] [battle] [lost].
John lost his computer.
John-i computer-lul ilepelyessta.
[John] [computer] [misplaced].
17. CIS 530 - Intro to NLP
17 WSD with Source Language Semantic Class Constraints
18. CIS 530 - Intro to NLP
18 Lexical Gaps: English to Chinese break
smash
shatter
snap
?
da po - irregular pieces
da sui - small pieces
pie duan -line
segments
19. CIS 530 - Intro to NLP
19 Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle)
20. CIS 530 - Intro to NLP
20 Examples of Three Approaches Direct:
I checked his answers against those of the teacher ?
Yo comparé sus respuestas a las de la profesora
Rule: [check X against Y] ? [comparar X a Y]
Transfer:
Ich habe ihn gesehen ? I have seen him
Rule: [clause agt aux obj pred] ? [clause agt aux pred obj]
Interlingual:
I like Mary? Mary me gusta a mí
Rep: [BeIdent (I [ATIdent (I, Mary)] Like+ingly)]
21. CIS 530 - Intro to NLP
21 Direct MT: Pros and Cons Pros
Fast
Simple
Inexpensive
Cons
Unreliable
Not powerful
Rule proliferation
Requires too much context
Major restructuring after lexical substitution
22. CIS 530 - Intro to NLP
22 Transfer MT: Pros and Cons Pros
Don’t need to find language-neutral rep
No translation rules hidden in lexicon
Relatively fast
Cons
N2 sets of transfer rules: Difficult to extend
Proliferation of language-specific rules in lexicon and syntax
Cross-language generalizations lost
23. CIS 530 - Intro to NLP
23 Interlingual MT: Pros and Cons Pros
Portable (avoids N2 problem)
Lexical rules and structural transformations stated more simply on normalized representation
Explanatory Adequacy
Cons
Difficult to deal with terms on primitive level: universals?
Must decompose and reassemble concepts
Useful information lost (paraphrase)
(Is thought really language neutral??)
24. An Gentle Introduction to Statistical MT: Core ideas
25. CIS 530 - Intro to NLP
25 Warren Weaver – 1949 Memorandum I Proposes Local Word Sense Disambiguation!
‘If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which.
But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . .’
26. CIS 530 - Intro to NLP
26 Warren Weaver – 1949 Memorandum II Proposes Interlingua for Machine Translation!
‘Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and—then re-emerge by whatever particular route is convenient.’
27. CIS 530 - Intro to NLP
27 Warren Weaver – 1949 Memorandum III Proposes Machine Translation using Information Theory!
‘It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code." If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?’
Weaver, W. (1949): ‘Translation’. Repr. in: Locke, W.N. and Booth, A.D. (eds.) Machine translation of languages: fourteen essays (Cambridge, Mass.: Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15-23.
28. CIS 530 - Intro to NLP
28 IBM Adopts Statistical MT Approach I (early 1990s) ‘In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.’
29. CIS 530 - Intro to NLP
29 IBM Adopts Statistical MT Approach II ‘The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.
As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition.’
30. CIS 530 - Intro to NLP
30 The Channel Model for Machine Translation
31. CIS 530 - Intro to NLP
31 Noisy Channel - Why useful? Word reordering in translation handled by P(S)
P(S) factor frees P(T | S) from worrying about word order in the “Source” language
Word choice in translation handled by P (T|S)
P(T| S) factor frees P(S) from worrying about picking the right translation
32. CIS 530 - Intro to NLP
32 An Alignment
33. CIS 530 - Intro to NLP
33 Fertilities and Lexical Probabilities for not
34. CIS 530 - Intro to NLP
34 Fertilities and Lexical Probabilities for hear
35. CIS 530 - Intro to NLP
35 Schematic of Translation Model
36. CIS 530 - Intro to NLP
36 How do we evaluate MT? Human-based Metrics
Semantic Invariance
Pragmatic Invariance
Lexical Invariance
Structural Invariance
Spatial Invariance
Fluency
Accuracy: Number of Human Edits required
HTER: Human Translation Error Rate
“Do you get it?”
Automatic Metrics: Bleu
37. CIS 530 - Intro to NLP
37 BiLingual Evaluation Understudy (BLEU —Papineni, 2001) Automatic Technique, but ….
Requires the pre-existence of Human (Reference) Translations
Compare n-gram matches between candidate translation and 1 or more reference translations
38. CIS 530 - Intro to NLP
38 Bleu Metric
39. CIS 530 - Intro to NLP
39 Bleu Metric
40. Thanks! CIS 530 - Intro to NLP
40