Unlocking Translation: Empirical MT Approaches

Machine Translation III Empirical approaches to MT: Example-based MT Statistical MT http://personalpages.manchester.ac.uk/staff/harold.somers/ LELA30431/chapter50.pdf http://www.statmt.org/

Introduction • Empirical approaches: what does that mean? • Empirical vs rationalist • Data-driven vs rule-driven • Pure empiricism: statistical MT • Hybrid empiricism: Example-based MT

Empirical approaches • Approaches based on pure data • Contrast with “rationalist” approach: rule-based systems of “2nd generation” • Larger storage, faster processors, and availability of textual data in huge quantities suggest data-driven approach may be possible • “Data” here means just raw text

Flashback • Early thoughts on MT (Warren Weaver 1949) included possibility that translation was like code-breaking (cryptanalysis). • Weaver – with Claude Shannon – invented “information theory” • Given enough data, patterns could be identified and applied to new text

Back to the future • Data-driven approach encouraged by availability of machine-readable parallel text, notably at first Canadian and Hong Kong Hansards, then EU documents, and dual-language web pages • Two basic approaches: • Statistical MT • Example-based MT

Example-based MT • “Translation by analogy” • First proposed by Nagao (1984) but not implemented until early 1990s • Very intuitive: translate text on the basis of recognising bits that have been previously translated, and sticking them together • Cf tourist phrasebook approach

Example-based MT • Like an extension of Translation Memory • Based on a database of translation examples • System finds closely matching previous example(s) • (unlike TM) identifies the corresponding fragments in the target text(s) (align) • And recombines them to give the target text

Kare wa o kau. kokusai seiji nitsuite kakareta hon Example (Sato & Nagao 1990) Input He buys a book on international politics Matches He buys a notebook. Kare wa nōto o kau. I read a book on international politics. Watashi wa kokusai seiji nitsuite kakareta hon o yomu. Result

Learning templates The monkey ate a peach. saru wa momo o tabeta. The man ate a peach. hito wa momo o tabeta. monkey saru man hito The … ate a peach.  … wa momo o tabeta. The dog ate a rabbit. inu wa usagi o tabeta. dog inu rabbit usagi The … ate a … .  … wa … o tabeta. The dog ate a peach.  inu wa momo o tabeta.

Some problems include … • Source of examples • Genuine text or hand-crafted? • Identifying matching fragments • Preprocessed • storage implication • Prejudge what will be useful • “on the fly” – needs a dictionary • Partial matching • Sticking fragments together (boundary friction) • Conflicting/multiple examples

Partial matching The operation was interrupted because the file was hidden. a. The operation was interrupted because the Ctrl-c key was pressed. b. The specified method failed because the file is hidden. c. The operation was interrupted by the application. d. The requested operation cannot be completed because the disk is full.

Kare wa o kau. Boundary friction (1) • Consider again: He buys a book on politics Matches He buys a notebook. Kare wa nōto o kau. He buys a pen. Kare wa pen o kau. I read a book on politics. Watashi wa seiji nitsuite kakareta hon o yomu. She wrote a book on politics. Kanojo wa seiji nitsuite kakareta hon o kaita. Result wa seiji nitsuite kakareta hon o Kare wa wa seiji nitsuite kakareta hon o o kau

Boundary friction (2) Input: The handsome boy entered the room Matches: The handsome boy ate his breakfast. I saw the handsome boy. Der schöne Junge aß sein Frühstück Ich sah den schönen Jungen.

Competing examples In closing, I will say that I am sad for workers in the airline industry. My colleague spoke about the airline industry. People in the airline industry have become unemployed. This tax will cripple some of the small companies in the airline industry. En terminant, je dirai que c’est triste pour les travailleurs et les travailleuses du secteur de l’aviation. Mon collègue a parlé de l’industriedu transport aérien. Des gens de l’industrie aérienne sont devenus chômeurs. Cette surtaxe va nuire aux petits transporteurs aériens. Results from Canadian Hansard using TransSearch

Statistical MT • Pioneered by IBM in early 1990s • Spurred on by better success in speech recognition of statistical over linguistic rule-based approaches • Idea that translation can be modelled as a statistical process • Seems to work best in limited domain where given data is a good model of future translations

Translation as a probabilistic problem • For a given SL sentence Si, there are  number of “translations” T of varying probability • Task is to find for Si the sentence Tj for which the probability P(Tj | Si) is the highest

Two models • P(Tj | Si) is a function of two models: • The probabilities of the individual words that make up Tj given the individual words in Si - the “translation model” • The probability that the individual words that make up Tj are in the appropriate order – “the language model”

Expressed in mathematical terms: Since S is a given, and constant, this can be simplified as Language model Translation model

So how do we translate? • For a given input sentence Si we have to have a practical way to find the Tjthat maximizes the formula • We have to start somewhere, so we start with the translation model: which words look most likely to help us? • In a systematic way we can keep trying different combinations together with the language model until we stop getting improvements

Input sentence Bag of possible words Most probable translation Translation model Language model Seek improvement by trying other combinations

Where do the models come from? • All the statistical parameters are pre-computed, based on a parallel corpus • Language model is probabilities of word sequences (n-grams) • Translation model is derived from aligned parallel corpus

The translation model • Take sentence-aligned parallel corpus • Extract entire vocabulary for both languages • For every word-pair, calculate probability that they correspond – e.g. by comparing distributions

Some obvious problems • “fertility”: not all word correspondences are 1:1 • Some words have multiple possible translations, e.g. the {le, la, l’, les} • Some words have no translation, e.g. in il se rase ‘he shaves’, se  • Some words are translated by several words, e.g. cheap peu cher • Not always obvioushow to align

 many:many not allowed; only 1:n (n0) and in practice, n<3 The ~ Les proposal ~ propositions will ~ seront } will not ~ ne seront pas not ~ ne…pas now ~ maintenant be ~  implemented ~ mises en application The proposal will not now be implemented Les propositions ne seront pas mises en application maintenant

French Pfertility P le .610 1 .871 la .178 0 .124 l’ .083 2 .004 les .023 ce .013 il .012 de .009 à .007 que .007 ‘the’ Some word-pair probabilities from Canadian Hansard French Pfertility P pas .469 2 .758 ne .460 0 .133 non .024 1 .106 faux .006 plus .002 ce .002 que .002 jamais .002 ‘not’ French Pfertility P bravo .992 0 .584 entendre .005 1 .416 entendu .002 entende .001 ‘hear’

Another problem: distortion • Notice that corresponding words do not appear in the same order. • The translation model includes probabilities for “distortion” • e.g. P(2|5): the P that ws in position 2 will produce a wt in position 5 • can be more complex: P(5|2,4,6): the P that ws in position 2 will produce a wt in position 5 when S has 4 words and T has 6.

The language model • Impractical to calculate probability of every word sequence: • Many will be very improbable … • Because they are ungrammatical • Or because they happen not to occur in the data • Probabilities of sequences of n words (“n-grams”) more practical • Bigram model: where P(wi|wi–1) f(wi–1, wi)/f(wi)

Sparse data • Relying on n-grams with a large n risks 0-probabilities • Bigrams are less risky but sometimes not discriminatory enough • e.g. I hire men who is good pilots • 3- or 4-grams allow a nice compromise, and if a 3-gram is previously unseen, we can give it a score based on the component bigrams (“smoothing”)

Put it all together and …? • To build a statistical MT system we need: • Aligned bilingual corpus • “Training programs” which will extract from the corpora all the statistical data for the models • A “decoder” which takes a given input, and seeks the output that evaluates the magic argmax formula – based on a heuristic search algorithm • Software for this purpose is freely available • e.g. • Claim is that an MT system for a new language pair can be built in a matter of hours

SMT latest developments • Nevertheless, quality is limited • SMT researchers quickly learned (just like in the 1960s) that this crude approach can get them so far (quite far actually), but that to go the extra distance you need linguistic knowledge (eg morphology, “phrases”, consitutents) • Latest developments aim to incorporate this • Big difference is that it too can be LEARNED (automatically) from corpora • So SMT still contrasts with traditional RBMT where rules are “hand coded” by linguists

Unlocking Translation: Empirical MT Approaches

Unlocking Translation: Empirical MT Approaches

Presentation Transcript

Statistical Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation

Machine Translation Speech Translation

Machine Translation

Machine Translation

Machine Translation

Human translation vs Machine translation

Machine Translation

Machine Translation