480 likes | 513 Views
This paper explores methods for training translation models between languages A and C through a bridge language B. It introduces a lexicon based on transduction models of cognate pairs and bridge languages. The approach involves mapping words from a source language to a target language by leveraging cognate pairs and cognate distance. The study focuses on Romance language family pairs like English-Spanish, Portuguese-Italian, and French-Romanian, using various distance measures including Levenshtein distance and stochastic transducers. The method's limitation includes assuming cognate pairs with an edit distance of less than 3. Evaluation on Romance languages and other language pairs shows promising results. The algorithm involves selecting word pairs for testing and adapting metrics based on edit distance. The study introduces diverse similarity measures and bridge languages to enhance the translation lexicons.
E N D
Using Pivot/Bridge Languages Matthias Eck
General Problem • Resources are available between languages A and B… and between languages B and C… but not C and A • How to train translation models between C and A? A C B
1st paper Multipath Translation Lexicon Induction via Bridge Languages • Gideon S. Mann and David Yarowsky • NAACL 2001 • Method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages
Lexicon via Cognate pairs Lexicon: • Mapping of word in source language to words in target language Here: • Lexicon is built between arbitrary languages using models of cognate pairs and cognate distance
General idea dictionary cognate model Romance Family English Spanish Portuguese Italian French Romanian source bridge target
Translation pairs • Cognate pairs can make up significant portion of lexicon if languages are in the same family and close
Cognate string edit distance • Obvious condition for a good distance D • So we choose …as the translation for s
Used distance measures • L: Levenshtein distance • Minimum sum of the costs of edit operations required to transform one string into another • Deletion, Substitution, Insertion – traditional cost 1 • S: Stochastic transducers • Probabilistic costs for each possible edit operation • H: Hidden Markov Model • Each character has separate edit operation parameters
Distance Measures Variants of Levenshtein distance: • L-V: vowel substitution cost only: 0.5 • L-S/L-A: Filter probabilities obtained by S into 3 classes 0.5, 0.75, 1 • L-S: Each pair separately trained • L-A: Collectively trained for all Romance languages Limitation • Method cannot discover translation pairs with having no surface form relationship • Assumed cognate pairs: Levenshtein edit distance < 3 • Few false positives
Intra Family Translation Lexicon Induction • Family: Romance languages • Available: dictionary (English/Bridge language) • General evaluation algorithm: • Select 100 word pairs from dictionary for testing • For adaptive metrics: Select hypothesized word pairs (Edit distance < 3) as cognate pairs and train on them • For each word in source language select closest word from the 100 target words
Results Source Languages: • Spanish, French, Italian, Romanian Target Language: • Portuguese • 1000 word pairs in dictionary for Spanish/Portuguese • 900 for other language pairs
Results • Pure Levenshtein distance works surprisingly well • S gives boost on French-Portuguese • Reason could be that Spanish-Portuguese are closer than French-Portuguese • L-S usually best
Consonant-to-consonant • Consonant-to-consonant edit operations • Most probable for French – Portuguese
Multiple bridge languages dictionary cognate model Slavic Family English Czech Russian Ukrainian Polish Serbian source bridge target
Translation Lexicon Induction Algorithm (One or more bridge languages) For each word s S For each bridge language B Translate s → b B t T, Calculate D(b,t) Rank t by D(b,t) Score t using information from all bridges Select highest scored t Map s → t
Results • One bridge languages, but multiple pathes
Different Pathways • English to Portuguese (via Romance languages) • English to Norwegian (via Germanic languages) • English to Ukrainian (via Slavic languages) • Portuguese to English (via Germanic languages, French)
2nd Paper Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages • Charles Schafer and David Yarowsky • COLING 2002 • Improves results of first paper by introducing additional similarity scores between candidate translations
Basic Idea • Decompose: • P(English|Serbian) = P(English|Czech) x P(Czech|Serbian) • For any language L close to Czech: • P(English|L) = P(English|Czech) x P(Czech|L) • P (Czech|L) as presented was done using similarity on cognate pairs
Covered Languages Serbian Ukrainian English Slovene Czech Slovak Bulgarian Polish Punjabi Nepali Hindi Gujarati Bengali Marathi
Serbian – Czech – English Czech – English dictionary: 171k word pairs Corpora:English: 192M wordsSerbian: 12M(News data from web) Gujarati – Hindi – English Hindi – English dictionary:74k word pairs Corpora:Gujarati: 2M Resources
Problem with Cognate Pairs Serbian Czech English favor not correct prazan prizen grace pazen patronage blank prazdny correct empty
Idea Introduce additional similarity models • Weighted Levenshtein Similarity • Context Similarity • Date distributional Similarity • Relative frequency Similarity • Burstiness Similarity and Inverse Document Frequency • Use of Additional Bridge Languages • Combine them with weighted string distance
Weighted Levenshtein Similarity • 1. Iteration: Vowel cluster operations have half the cost of single consonant substitutions, insertions and deletions • dist(vowel+, vowel+) • Use highest weighted of the top 2000 to re-estimate edit weights • Some high probability substitutions:
Context Similarity Compare narrow and wide contexts for candidates Context: bag of words (Narrow: radius 1/ Wide: radius 10) • Calculate Context on Source Language (Serbian) • Translate to English using current estimations • Compare with English Contexts via Cosine Similarity
Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 Context in Serbian Corpus with frequencies
2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration justice sovereignty country ornamental Translate with Initial Lexicon
2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration 0 0 justice sovereignty country ornamental Independence 3 1 10 0 479 836 191 0 Freedom 681 184 104 0 21 4 141 0 expression Context of Candidates in English Corpus religion
2 1.5 1.5 1.5 4 1.5 Context Similarity - Example Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4 majesty declaration 0 0 justice sovereignty country ornamental COS Independence 3 1 10 0 479 836 191 0 Freedom 681 184 104 0 21 4 141 0 expression Cosine Similarity finds correct candidate (Independence) religion
Date distributional Similarity • News Data: • Events are reported in parallel in multiple languages (+/- 2 days) • Construct term frequency vectors over time and compare candidates
Relative Frequencies • Word and translation are likely to have similar relative frequencies • Modest frequency variations are expected • Useful to rule out pairings with several orders of magnitude difference in relative frequency • Ratio of logs of frequencies correlates well with translational compatibility
Relative Frequency Similarity • Correct translation “laud” has higher RF Score than higher ranked incorrect candidates “calibre”, “quarter” and “class”
Burstiness Similarity • Define Burstiness to measure differences
Burstiness Similarity • Burstiness matches better for correct translations “laud” and “praise”
Combine the different measures • Weighted Levenshtein distance to get initial candidate pairs • Calculate 8 similarity measures • Weighted Levenshtein • Wide bag-of-words context similarity • Narrow bag of words context similarity • Local News date distribution similarity • All News date distribution similarity • IDF similarity • Burstiness similarity
Combine the different measures • Integrate similarity measures into a single similarity function: • POS SimilarityBias in favor of compatible parts of speech (N, V, ADJ)Penalty for non-matching candidates • Sort candidates for each score in decreasing orderAssign Ranks 0,1,… and normalize by count • Scoring: Similarity models have associated weights
Evaluation 3 Evaluation Criteria • Exact Match Accuracy • Percentage of correct English in the top k ranks • Median Position of the per word highest ranked correct translation
Results • Improvements with second bridge language
Additional Bridge Language Work Interlingua based Statistical Machine Translation • Manuel Kauers, Stephan Vogel, Christian Fügen, Alex Waibel • ICSLP 2002 • Paper covers SMT from Text to a structured Interlingua format (IF) • Corpus English/IF is available…but we also want to translate other languages into IF? English IF
Generalized problem • Assume we have translation model F to E and G to F… but we want G to E? • Decompose: • Because: E G F
And just translating… • Experiments done during PF-STAR project 2003/2004 • Training data: 48k lines of BTEC data • Test data: 506 lines, Test set for CSTAR 2003 • Translating Chinese → Italian • Also via a bridge language Chinese → English → Italian