Cognates and Word Alignment in Bitexts

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta

Outline • Background • Improving LCSR • Cognates vs. word alignment links • Experiments & results

Motivation • Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. • Reason: existence of cognates, which are usually orthographically and semantically similar. • Use: Considering cognates can improve word alignment and translation models.

Objective • Evaluation of orthographic similarity measures in the context of word alignment in bitexts.

MT applications • sentence alignment • word alignment • improving translation models • inducing translation lexicons • aid in manual alignment

Cognates • Similar in orthography or pronunciation. • Often mutual translations. • May include: • genetic cognates • lexical loans • names • numbers • punctuation

The task of cognate identification • Input: two words • Output: the likelihood that they are cognate • One method: compute their orthographic/phonetic/semantic similarity

Scope The measures that we consider are • language-independent • orthography-based • operate on the level of individual letters • binary identity function

Similarity measures • Prefix method • Dice coefficient • Longest Common Subsequence Ratio (LCSR) • Edit distance • Phonetic alignment • Many other methods

IDENT • 1 if two words are identical, 0 otherwise • The simplest similarity measure • e.g. IDENT(colour, couleur) = 0

PREFIX • The ratio of the longest common prefix of two words to the length of the longer word • e.g. PREFIX(colour, couleur) = 2/7 = 0.28

DICE coefficient • The ratio of the number of common letter bigrams to the total number of letter bigrams • e.g. DICE(colour, couleur) = 6/11 = 0.55

Longest Common Sub-sequence Ratio (LCSR) • The ratio of the longest common subsequence of two words to the length of the longer word. • e.g. LCSR(colour, couleur) = 5/7 = 0.71

LCSR • Method of choice in several papers • Weak point: insensitive to word length • Example • LCSR(walls, allés) = 0.8 • LCSR(sanctuary, sanctuaire) = 0.8 • Sometimes a minimal word length imposed • A principled solution?

The random model • Assumption: strings are generated randomly from a given distribution of letters. • Problem: what is the probability of seeing k matches between two strings of length m and n?

A special case • Assumption: k=0 (no matches) • t – alphabet size • S(n,i) - Stirling number of the second kind

The problem • What is the probability of seeing k matches between two strings of length m and n? • An exact analytical formula is unlikely to exist. • A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. • Approximations developed in bioinformatics are not applicable to words because of length differences.

Solutions for the general case • Sampling • Not reliable for small probability values • Works well for low k/n ratios (uninteresting) • Depends on a given alphabet size and letter frequencies • No insight • Inexact approximation • Works well for high k/n ratios (interesting) • Easy to use

Formula 1 - probability of a match

Formula 1 • Exact for k=m=n • Inexact in general • Reason: implicit independence assumption • Lower bound for the actual probability • Good approximation for high k/n ratios. • Runs into numerical problems for larger n

Formula 2 • Expected number of pairs of k-letter substrings. • Approximates the required probability for high k/n ratios.

Formula 2 • Does not work for low k/n ratios. • Not monotonic. • Simpler than Formula 1. • More robust against numerical underflow for very long words.

Comparison of both formulas • Both are exact for k=m=n • For k close to max(m,n) • both formulas are good approximations • their values are very close • Both can be quickly computed using dynamic programming.

LCSF • A new similarity measure based on Formula 2. • LCSR(X,Y) = k/n • LCSF(X,Y) = • LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored

Evaluation - motivation • Intrinsic evaluation of orthographic similarity is difficult and subjective. • My idea: extrinsic evaluation on cognates and word aligned bitexts. • Most cross-language cognates are orthographically similar and vice-versa. • Cognation is binary and not subjective

Cognates vs alignment links • Manual identification of cognates is tedious. • Manually word-aligned bitexts are available, but only some of the links are between cognates. • Question #1: can we use manually-constructed word alignment links instead?

Manual vs automatic alignment links • Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. • Question #2: can we use machine-generated word alignment links instead?

Evaluation methodology • Assumption: a word aligned bitext • Treat aligned sentences as bags of words • Compute similarity for all word pairs • Order word pairs by their similarity value • Compute precision against a gold standard • either a cognate list or alignment links

Test data • Blinker bitext (French-English) • 250 Bible verse pairs • manual word alignment • all cognates manually identified • Hansards (French-English) • 500 sentences • manual and automatic word-alignment • Romanian-English • 248 sentences • manually aligned

Blinker results

Hansards results

Romanian-English results

Contributions • We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. • We proposed a new similarity measure which outperforms LCSR.

Future work • Extend our approach to length normalization to edit distance and other similarity measures. • Incorporate cognate information into statistical MT models as an additional feature function.

Thank you

Cognates and Word Alignment in Bitexts