Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

A Technical Word and Term Translation Aid using Noisy Parallel Corpora across Language GroupsPascale Fung, Kathleen McKeownMachine Translation, 1997 Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Outline • Motivation • Objective • Introduction • Related work • Noisy parallel corpora across language groups • Algorithm overview • Experiments • Conclusion

Motivation • The difficult task, technical term translation • Translators quality and domain specific terminology. • Not adequately covered by printed dictionaries. • Terms from noisy parallel corpora, especially. • Ex: • Hong Kong Governor /香港總督 • Basic Law / 基本法 • Green Paper / 綠皮書

Objective • This paper describes an algorithm for • translating technical words and • terms from noisy parallel corpora across language groups. 2 to 1

1. Introduction • Technical terms • often cannot be translated on a word by word basis. • The individual words of the term may have many possible translations. • Example: Governor • 總督, 主管(top manager) , 總裁(chief), 州長(of a State) • Hong Kong Governor – 香港總督 • Domain-specific terms • Basic Law / 基本法 • Green Paper / 綠皮書

1. Introduction • An algorithm for translating technical terms given a noisy parallel corpus as input • Notion • similar words won’t occur at the exact same position in each half of the corpus • distances between instances of the same word will be similar across languages • Method • To find word correlations and then builds technical terms translations. • Dynamic time warping algorithm. • Reliable anchor points.

2. Related work • Sentence alignment • Segment alignment • Word and term translation • Word alignment • Phrase translation

2.1. Sentence alignment • Two main approaches • Text-based: use of lexical information (dictionary) • Use paired lexical indicators across the languages to find matching sentences. • Length-based: use of the total number of characters (words) • Make the assumption that translated sentences in the parallel corpus will be of approximately the same, or constantly related, length.

2.2. Segment alignment • Church(1993) show that we can align a text by using delimiters. • Segment alignment is more appropriate for aligning noisy corpora. • The problem is finding reliable anchor points that can be used for Asian/Romance language pairs.

2.3. Word and term translation • Some algorithms used for alignment produce a small bilingual lexicon. • Some others use sentence-aligned parallel text. • Most of the following algorithms require clean, sentence-aligned parallel text input.

2.4. Word alignment • [Brown et al. 1990, Brown et al. 1993] • [Gale & Church 1991] • [Dagan et al. 1993] • [Wu & Xia 1994] • Various filtering techniques are used to improve the matching.

2.5. Phrase translation • [Kupiec1993] • [Smadja & McKeown1993] • [Dagan & Church1994] • All the work described in this section assumes a clean, parallel corpus as input.

3. Noisy parallel corpora across language groups • Previous approaches are lack of robustness • Against structural noise in parallel corpora. • Against language pairs which don’t share etymological roots. • Still exist problems • Bilingual texts which are translations of each other but are not translated sentence by sentence. • Language robustness.

3. Noisy parallel corpora across language groups • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora. • Two noisy parallel corpora • English version of the AWK manual and its Japanese translation. • Parts of the HKUST English-Chinese Bilingual Corpora.

4. Algorithm overview • Treat the domain word translation problem as a pattern matching problem • Each word shares some common features with its counterpart in the translated text. • To find the best representations of these features and the best ways to match them.

1 – 4 Corpus English Chinese Tag English word list Tokenize Japanese and Chinese texts, and form a word list Algorithm overview 5. Compile non-linear segment boundaries with high frequency word pairs 1. Primary lexicon 2. Anchor points for alignment 3. Align the text 4. Secondary lexicon 6. Compile bilingual word lexicon 7. Suggest a word list for each technical term to the translator

5. Extracting technical terms from English text • To find domain-specific terms, we tagged the English part of the corpus by a modified POS tagger • Extracted noun phrases which are most likely to be technical terms. • To find the translations for words which are part of these terms only.

6. Tokenization of Chinese and Japanese texts • Tokenization of the Chinese text is done by using a statistically augmented dictionary-based tokenizer which is able to recognize frequent domain words. • Example: 基本法/Basic Law • The Japanese text is tokenized by JUMAN without domain word augmentation.

7. A rough word pair based alignment • Treat translation as a pattern matching task. • The task is to find a representation and similarity measurement which can find word pairs to serve as anchor points.

7.1. Dynamic Recency Vectors • Governor • The word position • <2380,2390,2463,…> of length 212. • Recency vector • <10,73,102,…> • 總督 • The word position • <90,2021,2150,…> of length 254. • Recency vector • <1931,129,8,…>

Recency vector signals Governor.ch Governor.en Bill.ch President.en

總督 Governor 7.2. Matching Recency Vectors • Dynamic time warping, DTW • Takes two vectors of lengths N and M, finds an optimal path through the N by M trellis, starting from (1,1) to (N,M).

總督 Governor DTW algorithm • Initialization • Costs are initialized according to recency vector values

總督 Governor DTW algorithm • Recursion • To accumulate cost of the DTW path

總督 Governor DTW algorithm • Termination • Final cost of the DTW path is normalized by the length of the path.

總督 Governor DTW algorithm • Path reconstruction • Reconstruct the DTW path and obtain the points on the path. • For finding anchor points and eliminating noise use.

DTW algorithm • For each word vector in language A, the word vector in language B which has lowest DTW score is taken to be its translation. • We thresholded the bilingual word pairs obtained from above stages in the algorithm and stored the more reliable pairs as our primary bilingual lexicon.

7.3. Statistical filters • To avoid the complexity, we incorporated constraints to filter the set of possible pairs • Starting point constraints, i.e., position constraint. • Length constraint, i.e., frequency constraint. • Means/standard deviation constraint

8. Finding anchor points and eliminating noise • Primary lexicon is used for aligning the segments in the corpus • To find anchor points on the DTW paths which divide the texts into multiple aligned segments for the secondary lexicon. • We only keep an anchor point (i,j) if it satisfies the following • (slope constraint) • (continuity constraint) • (window size constraint) • (offset constraint)

8. Finding anchor points and eliminating noise AWK HKUST Text alignment path After filtering All word pairs

9. Finding bilingual word pair matches • To obtain the secondary and final bilingual word lexicon • A non-linear K segment binary vector representation for each word. • A similarity measure to compute word pair correlations.

9.1. Non-Linear K segments • The anchor points <(i1,j1),(i2,j2),…,(ik,jk)> divide a bilingual corpus into k+1 non-linear segments, where i in text1 and j in text2. • The algorithm then proceeds to obtain a secondary bilingual lexicon, considering words of both high and low frequency.

9.2. Non-Linear segment binary vectors • The occurrences of a pair of translated words in a bilingual corpus, i.e., to compute the correlation between two words. • Pr(ws, wt) occurring in the same place in the corpus. • Binary vector where the i-th bit is set to 1 if both words are found in the i-th segment. governor K segments

T F T F 9.2. Non-Linear segment binary vectors • If the source and target words are good translations of one another, then a should be large.

9.3. Binary vector correlation measure • Similarity measure, weighted mutual information

10. Word translation results

11. Term translations from word groups

Term translation aid result

Conclusion • A technique to align noisy parallel corpora by segments, and to extract a bilingual word lexicon from it. • Substitute the sentence alignment step with a rough segment alignment. • No sentence boundary information and with noise. • Highly reliable anchor points using DTW to serve as segment delimiters.

Personal opinion • Valuable idea • Treat the domain word translation problem as a pattern matching problem. • Contribution • Language robustness and noisy parallel corpora. • Drawback • Too long and too complex.

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management