400 likes | 467 Views
A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language Groups Pascale Fung, Kathleen McKeown Machine Translation, 1997. Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management. Outline. Motivation Objective Introduction
E N D
A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language GroupsPascale Fung, Kathleen McKeownMachine Translation, 1997 Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management
Outline • Motivation • Objective • Introduction • Related work • Noisy parallel corpora across language groups • Algorithm overview • Experiments • Conclusion
Motivation • The difficult task, technical term translation • Translators quality and domain specific terminology. • Not adequately covered by printed dictionaries. • Terms from noisy parallel corpora, especially. • Ex: • Hong Kong Governor /香港總督 • Basic Law / 基本法 • Green Paper / 綠皮書
Objective • This paper describes an algorithm for • translating technical words and • terms from noisy parallel corpora across language groups. 2 to 1
1. Introduction • Technical terms • often cannot be translated on a word by word basis. • The individual words of the term may have many possible translations. • Example: Governor • 總督, 主管(top manager) , 總裁(chief), 州長(of a State) • Hong Kong Governor – 香港總督 • Domain-specific terms • Basic Law / 基本法 • Green Paper / 綠皮書
1. Introduction • An algorithm for translating technical terms given a noisy parallel corpus as input • Notion • similar words won’t occur at the exact same position in each half of the corpus • distances between instances of the same word will be similar across languages • Method • To find word correlations and then builds technical terms translations. • Dynamic time warping algorithm. • Reliable anchor points.
2. Related work • Sentence alignment • Segment alignment • Word and term translation • Word alignment • Phrase translation
2.1. Sentence alignment • Two main approaches • Text-based: use of lexical information (dictionary) • Use paired lexical indicators across the languages to find matching sentences. • Length-based: use of the total number of characters (words) • Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.
2.2. Segment alignment • Church(1993) show that we can align a text by using delimiters. • Segment alignment is more appropriate for aligning noisy corpora. • The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.
2.3. Word and term translation • Some algorithms used for alignment produce a small bilingual lexicon. • Some others use sentence-aligned parallel text. • Most of the following algorithms require clean, sentence-aligned parallel text input.
2.4. Word alignment • [Brown et al. 1990, Brown et al. 1993] • [Gale & Church 1991] • [Dagan et al. 1993] • [Wu & Xia 1994] • Various filtering techniques are used to improve the matching.
2.5. Phrase translation • [Kupiec1993] • [Smadja & McKeown1993] • [Dagan & Church1994] • All the work described in this section assumes a clean, parallel corpus as input.
3. Noisy parallel corpora across language groups • Previous approaches are lack of robustness • Against structural noise in parallel corpora. • Against language pairs which don’t share etymological roots. • Still exist problems • Bilingual texts which are translations of each other but are not translated sentence by sentence. • Language robustness.
3. Noisy parallel corpora across language groups • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora. • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora.
4. Algorithm overview • Treat the domain word translation problem as a pattern matching problem • Each word shares some common features with its counterpart in the translated text. • To find the best representations of these features and the best ways to match them.
1 – 4 Corpus English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list Algorithm overview 5. Compile non-linear segment boundaries with high frequency word pairs 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator
5. Extracting technical terms from English text • To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger • Extracted noun phrases which are most likely to be technical terms. • To find the translations for words which are part of these terms only.
6. Tokenization of Chinese and Japanese texts • Tokenization of the Chinese text is done by using a statistically augmented dictionary-based tokenizer which is able to recognize frequent domain words. • Example: 基本法/Basic Law • The Japanese text is tokenized by JUMAN without domain word augmentation.
7. A rough word pair based alignment • Treat translation as a pattern matching task. • The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.
7.1. Dynamic Recency Vectors • Governor • The word position • <2380,2390,2463,…> of length 212. • Recency vector • <10,73,102,…> • 總督 • The word position • <90,2021,2150,…> of length 254. • Recency vector • <1931,129,8,…>
Recency vector signals Governor.ch Governor.en Bill.ch President.en
總督 Governor 7.2. Matching Recency Vectors • Dynamic time warping, DTW • Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M).
總督 Governor DTW algorithm • Initialization • Costs are initialized according to recency vector values
總督 Governor DTW algorithm • Recursion • To accumulate cost of the DTW path
總督 Governor DTW algorithm • Termination • Final cost of the DTW path is normalized by the length of the path.
總督 Governor DTW algorithm • Path reconstruction • Reconstruct the DTW path and obtain the points on the path. • For finding anchor points and eliminating noise use.
DTW algorithm • For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation. • We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.
7.3. Statistical filters • To avoid the complexity, we incorporated constraints to filter the set of possible pairs • Starting point constraints, i.e., position constraint. • Length constraint, i.e., frequency constraint. • Means/standard deviation constraint
8. Finding anchor points and eliminating noise • Primary lexicon is used for aligning the segments in the corpus • To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon. • We only keep an anchor point (i,j) if it satisfies the following • (slope constraint) • (continuity constraint) • (window size constraint) • (offset constraint)
8. Finding anchor points and eliminating noise AWK HKUST Text alignment path After filtering All word pairs
9. Finding bilingual word pair matches • To obtain the secondary and final bilingual word lexicon • A non-linear K segment binary vector representation for each word. • A similarity measure to compute word pair correlations.
9.1. Non-Linear K segments • The anchor points <(i1,j1),(i2,j2),…,(ik,jk)> divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2. • The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.
9.2. Non-Linear segment binary vectors • The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words. • Pr(ws, wt) occurring in the same place in the corpus. • Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. governor K segments
T F T F 9.2. Non-Linear segment binary vectors • If the source and target words are good translations of one another, then a should be large.
9.3. Binary vector correlation measure • Similarity measure, weighted mutual information
Conclusion • A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. • Substitute the sentence alignment step with a rough segment alignment. • No sentence boundary information and with noise. • Highly reliable anchor points using DTW to serve as segment delimiters.
Personal opinion • Valuable idea • Treat the domain word translation problem as a pattern matching problem. • Contribution • Language robustness and noisy parallel corpora. • Drawback • Too long and too complex.