1 / 35

Cognates and Word Alignment in Bitexts

Cognates and Word Alignment in Bitexts. Greg Kondrak University of Alberta. Outline. Background Improving LCSR Cognates vs. word alignment links Experiments & results. Motivation.

ann
Download Presentation

Cognates and Word Alignment in Bitexts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta

  2. Outline • Background • Improving LCSR • Cognates vs. word alignment links • Experiments & results

  3. Motivation • Claim: words that are orthographically similar are more likely to be mutual translations than words that are not similar. • Reason: existence of cognates, which are usually orthographically and semantically similar. • Use: Considering cognates can improve word alignment and translation models.

  4. Objective • Evaluation of orthographic similarity measures in the context of word alignment in bitexts.

  5. MT applications • sentence alignment • word alignment • improving translation models • inducing translation lexicons • aid in manual alignment

  6. Cognates • Similar in orthography or pronunciation. • Often mutual translations. • May include: • genetic cognates • lexical loans • names • numbers • punctuation

  7. The task of cognate identification • Input: two words • Output: the likelihood that they are cognate • One method: compute their orthographic/phonetic/semantic similarity

  8. Scope The measures that we consider are • language-independent • orthography-based • operate on the level of individual letters • binary identity function

  9. Similarity measures • Prefix method • Dice coefficient • Longest Common Subsequence Ratio (LCSR) • Edit distance • Phonetic alignment • Many other methods

  10. IDENT • 1 if two words are identical, 0 otherwise • The simplest similarity measure • e.g. IDENT(colour, couleur) = 0

  11. PREFIX • The ratio of the longest common prefix of two words to the length of the longer word • e.g. PREFIX(colour, couleur) = 2/7 = 0.28

  12. DICE coefficient • The ratio of the number of common letter bigrams to the total number of letter bigrams • e.g. DICE(colour, couleur) = 6/11 = 0.55

  13. Longest Common Sub-sequence Ratio (LCSR) • The ratio of the longest common subsequence of two words to the length of the longer word. • e.g. LCSR(colour, couleur) = 5/7 = 0.71

  14. LCSR • Method of choice in several papers • Weak point: insensitive to word length • Example • LCSR(walls, allés) = 0.8 • LCSR(sanctuary, sanctuaire) = 0.8 • Sometimes a minimal word length imposed • A principled solution?

  15. The random model • Assumption: strings are generated randomly from a given distribution of letters. • Problem: what is the probability of seeing k matches between two strings of length m and n?

  16. A special case • Assumption: k=0 (no matches) • t – alphabet size • S(n,i) - Stirling number of the second kind

  17. The problem • What is the probability of seeing k matches between two strings of length m and n? • An exact analytical formula is unlikely to exist. • A very similar problem has been studied in bioinformatics as statistical significance of alignment scores. • Approximations developed in bioinformatics are not applicable to words because of length differences.

  18. Solutions for the general case • Sampling • Not reliable for small probability values • Works well for low k/n ratios (uninteresting) • Depends on a given alphabet size and letter frequencies • No insight • Inexact approximation • Works well for high k/n ratios (interesting) • Easy to use

  19. Formula 1 - probability of a match

  20. Formula 1 • Exact for k=m=n • Inexact in general • Reason: implicit independence assumption • Lower bound for the actual probability • Good approximation for high k/n ratios. • Runs into numerical problems for larger n

  21. Formula 2 • Expected number of pairs of k-letter substrings. • Approximates the required probability for high k/n ratios.

  22. Formula 2 • Does not work for low k/n ratios. • Not monotonic. • Simpler than Formula 1. • More robust against numerical underflow for very long words.

  23. Comparison of both formulas • Both are exact for k=m=n • For k close to max(m,n) • both formulas are good approximations • their values are very close • Both can be quickly computed using dynamic programming.

  24. LCSF • A new similarity measure based on Formula 2. • LCSR(X,Y) = k/n • LCSF(X,Y) = • LCSF is as fast as LCSR because its values that depend only on k and n can be pre-computed and stored

  25. Evaluation - motivation • Intrinsic evaluation of orthographic similarity is difficult and subjective. • My idea: extrinsic evaluation on cognates and word aligned bitexts. • Most cross-language cognates are orthographically similar and vice-versa. • Cognation is binary and not subjective

  26. Cognates vs alignment links • Manual identification of cognates is tedious. • Manually word-aligned bitexts are available, but only some of the links are between cognates. • Question #1: can we use manually-constructed word alignment links instead?

  27. Manual vs automatic alignment links • Automatically word-aligned bitext are easily obtainable, but a good fraction of the links are wrong. • Question #2: can we use machine-generated word alignment links instead?

  28. Evaluation methodology • Assumption: a word aligned bitext • Treat aligned sentences as bags of words • Compute similarity for all word pairs • Order word pairs by their similarity value • Compute precision against a gold standard • either a cognate list or alignment links

  29. Test data • Blinker bitext (French-English) • 250 Bible verse pairs • manual word alignment • all cognates manually identified • Hansards (French-English) • 500 sentences • manual and automatic word-alignment • Romanian-English • 248 sentences • manually aligned

  30. Blinker results

  31. Hansards results

  32. Romanian-English results

  33. Contributions • We showed that word alignment links can be used instead of cognates for evaluating word similarity measures. • We proposed a new similarity measure which outperforms LCSR.

  34. Future work • Extend our approach to length normalization to edit distance and other similarity measures. • Incorporate cognate information into statistical MT models as an additional feature function.

  35. Thank you

More Related