Duluth Word Alignment System

Duluth Word Alignment System Bridget Thomson McInnes Ted Pedersen University of Minnesota Duluth Computer Science Department 31 May 2003

Duluth Word Alignment System • Perl implementation of IBM Model 2 • Learns a probabilistic model from sentence aligned parallel corpora • The parallel text consists of a source language text and its translation into some target language • Determines the word alignments of the sentence pairs • Missing data problem • No examples of word alignments in the training data • Use the Expectation Maximization (EM) Algorithm

IBM Model 2 • Takes into account • The probability of the two words being translations of each other • how likely it is for words at particular positions in a sentence pair to be alignments of each other • Example 2 1 3 1 2 3

Distortion Factor • Distortion Factor • How far away from the original (source) position can the word move • Example: Source sentence : Target sentence :

Types of Alignments • Sure and Probable alignments • Sure : Alignment judged to be very likely • Probable : Alignment judged to be less certain • Our system does not make this distinction, we take the highest alignment regardless of the value • No-null and Null alignments • Our system does not include null alignments • Null alignments : source words that do not align to any word in the target sentence • One-to-One and One-to-Many alignments • Our system includes one-to-many as well as one to one alignments

Alignments One to One One to Many Many to One S1 S2 S3 S4 S5 T1 T2 T3 T4 T5

Data • English – French • Trained • 5% subset of the Aligned Hansards of the 36th Parliament of Canada • Approximately 50,000 out of the 1,200,000 given sentence pairs • Mixture of the House and Senate debates • We wanted to train the model on comparable size data sets • Tested • 447 manually word aligned sentence pairs • Romanian – English • Trained on all available training data (49,284 sentence pairs) • Tested • 248 manually word aligned sentence pairs

Results

Precision and Recall Results • Precision of the two language pairs were similar • This may reflect the fact that we used approximately the same amount of training data for each of the models • The recall for the English-French data was low • This system does not find alignments in which many English words align to one French word. • This reduced the number of alignment made by the system in comparison to the number of alignments in the gold standard

Distortion Results • The precision and recall were not significantly affected by the distortion factor • Distortion factor of 0 resulted in lower precision and recall than a distortion factor of 2, 4 or 6 • Distortion factor of 2, 4, 6 resulted in approximately the same precision and recall values for each of the different language sets • The distortion factor of 4 and 6 do not contain any more information than a distortion factor of 2 • suggests that word movement is limited

Conclusions of Training Data • Small amount of training data • wanted to compare the Romanian English and the English French results • Although the data for Romanian English was different than the data for English French the results were comparable • would like to increase the training data to determine the how much of an improvement of the results could be obtained

Conclusions • Considering modifying the existing Perl implementation to allow for this • Database approach • Berkeley DB • NDBM • re-implementing the algorithm in Perl Data Language • Perl module that is optimized for matrix and scientific computing

Duluth Word Alignment System

Duluth Word Alignment System

Presentation Transcript

Word and Phrase Alignment

the RASNIK Alignment System

City of Duluth Water System Needs

Machine Translation Discriminative Word Alignment

Statistical Machine Translation Word Alignment

Towards Syntactically Constrained Statistical Word Alignment

THE CMS ALIGNMENT SYSTEM

Support System and Alignment

Laser alignment system

Machine Translation Word Alignment

Tracker Alignment System

Cognates and Word Alignment in Bitexts

Human Body Alignment System

Duluth Watersheds

(Statistical) Approaches to Word Alignment

Word Alignment

Word and Phrase Alignment

Collocation Extraction Using Monolingual Word Alignment Method

Word Alignment

Semi-Supervised Boosting for Statistical Word Alignment

Word and Phrase Alignment

(Statistical) Approaches to Word Alignment