1 / 67

Machine Transliteration

Machine Transliteration. Joshua Waxman. Overview. Words written in a language with alphabet A  written in a language with alphabet B שלום  “shalom” Importance for MT, for cross-language IR Forward transliteration, Romanization, back-transliteration.

lynton
Download Presentation

Machine Transliteration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Transliteration Joshua Waxman

  2. Overview • Words written in a language with alphabet A  written in a language with alphabet B • שלום  “shalom” • Importance for MT, for cross-language IR • Forward transliteration, Romanization, back-transliteration

  3. Is there a convergence towards standards? Perhaps for really famous names. Even for such standard names, multiple acceptable spellings. Whether there is someone regulating such spellings probably dependent culturally. In meantime, have a lot of variance. Especially on Web. E.g. holiday of Succot, סוכות, סֻכּוֹת Variance in pronunciation culturally across different groups (soo-kot, suh-kes) = dialect, variance in how one chooses to transliterate different Hebrew letters (kk, cc, gemination). • Sukkot: 7.1 million • Succot: 173 thousand • Succos: 153 thousand • Sukkoth: 113 thousand • Succoth: 199 thousand • Sukos: 112 thousand • Sucos: 927 thousand, but probably almost none related to holiday • Sucot: 101 thousand. Spanish transliteration of holiday • Sukkes: 1.4 thousand. Yiddish rendition • Succes: 68 million. Misspelling of “success” • Sukket: 45 thousand. But not Yiddish, because wouldn’t have “t” ending Recently in the news: AP: Emad Borat; Arutz Sheva: Imad Muhammad Intisar Boghnat

  4. Can we enforce standards? • Would make task easier. • News articles, perhaps • However: • Would they listen to us? • Does the standard make sense across the board? Once again, dialectal differences. E.g. ת, ה, vowels. Also, fold-over of alphabet. ע-א, ק-כ, ח-כ, ת-ט, ת-ס • 2Nfor N laguages

  5. Four Papers • “Cross Linguistic Name Matching in English and Arabic” • For IR – search. Fuzzy string matching. Modification of Soundex to use cross-language mapping, using character equivalence classes • “Machine Transliteration” • For Machine translation. Back transliteration. 5 steps in transliteration. Use Bayes’ rule • “Transliteration of Proper Names in Cross-Language Applications” • Forward transliteration, purely statistical based • “Statistical Transliteration for English-Arabic Cross Language Information Retrieval” • Forward transliteration. For IR, generating every possible transliteration, then evaluate. Using selected n-gram model

  6. Cross Linguistic Name Matching in English and ArabicA “One to Many Mapping” Extension of the Levenshtein Edit Distance Algorithm Dr. Andrew T. Freeman, Dr. Sherri L. Condon and Christopher M. Ackerman The Mitre Corporation

  7. Cross Linguistic Name Matching • What? • Match personal names in English to the same names in Arabic script. • Why is this not a trivial problem? • There are multiple transcription schemes, so it is not one-to-one • e.g. معمر القذافي can be Muammar Gaddafi, Muammar Qaddafi, Moammar Gadhafi, Muammar Qadhafi, Muammar al Qadhafi • because certain consonants and vowels can be represented multiple ways in English • note: Arabic is just an example of this phenomenon • so standard string comparison insufficient • For What purpose? • For search on, say, news articles. How do you match all occurrences of “Qadhafi” • Their solution • Enter the search term in Arabic, use Character Equivalence Classes (CEQ) to generate possible transliterations, supplement the Levenshtein Edit Distance Algorithm

  8. Elaboration on Multiple Transliteration Schemes • Why? • No standard English phoneme corresponding to Arabic /q/ • Different dialects – in Libya, this is pronounced [g] • note: Similar for Hebrew dialects

  9. Fuzzy string matching • def: matching strings based on similarity rather than identity • Examples: • edit-distance • n-gram matching • normalization procedures like Soundex.

  10. Survey of Fuzzy Matching Methods - Soundex • Soundex • Odell and Russel, 1918 • Some obvious pluses: • (not mentioned explicitly by paper) • we eliminate vowels, so Moammar/Muammar not a problem • Groups of letters will take care of different English letters corresponding to Arabic • Elimination of repetition and of h will remove gemination/fricatives • Some minuses • Perhaps dialects will transgress Soundex phonetic code boundaries. e.g. ת in Hebrew can be t, th, s. ח can be ch or h. Is a ו to be w or v? But could modify algorithm to match. • note al in al-Qadafi • Perhaps would match too many inappropriate results

  11. Noisy Channel Model

  12. Levenshtein Edit Distance • AKA Minimum Edit Distance • Minimum number of operations of insertion, deletion, substitution. Cost per operation = 1 • Via dynamic programming • Example taken from Jurafsky and Martin, but with corrections • Minimum of diagonal + subst, or down/left + insertion/deletion cost

  13. Minimum Edit Distance Example(substitution cost = 2)

  14. Minimum Edit Distance Example(substitution cost = 1)

  15. Minimum Edit Distance • Score of 0 = perfect match, since no edit ops • s of len m, t of len n • Fuzzy match: divide edit score by length of shortest (or longest) string, 1 – this number. Set threshold for strings to be a match. Then, longer pairs of strings more likely to be matched than shorter pairs of strings with same number of edits. So get percentage of chars that need ops. Otherwise, “A” vs “I” has same edit distance as “tuning” vs. “turning.” • Good algorithm for fuzzy string comparison – can see that Muammar Gaddafi, Muammar Qaddafi, Moammar Gadhafi, Muammar Qadhafi, Muammar al Qadhafi are relatively close. • But, don’t really want substitution cost of G/Q, O/U, DD/DH, certain insertion/deletion costs. That is why they supplement it with these Character Equivalence Classes (CEQ), which we’ll get to a bit later.

  16. Editex • Zobel and Dart (1996) – Soundex + Levenshtein Edit Distance • replace e(si, tj) which was basically 1 if unequal, 0 if equal (that is, cost of an op), with r(si, tj), which makes use of Soundex equivalences. 0 if identical, 1 if in same group, 2 if different • Also neutralizes h and w in general. Show example based on chart from before. In terms of initializing or calculating cost of insertion/deletion, do not count, otherwise have cost of 1. • Other enhancements to standard Soundex and Edit distance for the purpose of comparison. e.g. tapering – (counts less later in the word); phonometric methods – input strings mapped to phonemic representations. E.g. rough. • Say performed better than Soundex, Min Edit Distance, counting n-gram sequences, ~ 10 permutations of tapering, phonemetric enhancements to standard algorithms

  17. SecondString (Tool) • Java based implementation of many of these string matching algorithms. They use this for comparison purposes. Also, SecondString allows hybrid algorithms by mixing and matching, tools for string matching metrics, tools for matching tokens within strings.

  18. Baseline Task (??) • Took 106 Arabic, 105 English texts from newswire articles • Took names from these articles, 408 names from English, 255 names from Arabic. • manual cross-script matching, got 29 common names (rather than manually coming up with all possible transliterations) • But to get baseline, tried matching all names in Arabic (transliterated using Atrans by Basis – 2004) to all names in English, using algorithms from SecondString. Thus, have one standard transliteration, and try to match it to all other English transliterations • Empirically set threshold to something that yielded good result. • R = recall = # correctly matched English names / # available correct English matches in set; what percentage of total correct did they get? • P = Precision = total # correct names / total # of names returned; what percentage of their guesses were accurate? • Defined F-score as 2 X (PR) / (P + R)

  19. Other Algorithms Used For Comparison • Smith – Waterman = Levenstein Edit, with some parameterization of gap score • SLIM = iterative statistical learning algorithm based on a variety of estimation-maximization in which a Levenshtein edit-distance matrix is iteratively processed to find the statistical probabilities of the overlap between two strings. • Jaro = n-gram • Last one is Edit distance

  20. Their Enhancements • Motivation: Arabic letter has more than one possible English letter equivalent. Also, Arabic transliterations of English names not predictable. 6 different ways to represent Milosevic in Arabic.

  21. Some Real World Knowledge

  22. Character Equivalence Classes • Same idea as Editex, except use Ar(si, tj) where s is an Arabic word, so si is an Arabic letter, and t is an English word, and tj is an English letter. • So, comparing Arabic to English directly, rather than a standard transliteration • The sets within Ar to handle (modified) Buckwater transliteration, default transliteration of Basis’ software • Basis’ uses English digraphs for certain letters

  23. Buckwalter Transliteration Scheme A “scholarly” transliteration scheme, unlikely to be found in newspaper articles: Wikipedia:The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is an ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the more common romanization schemes that add morphological information not expressed in Arabic script. Thus, for example, a waw will be transliterated as w regardless of whether it is realized as a vowel [u:] or a consonant [w]. Only when the waw is modified by a hamza ( ؤ) does the transliteration change to &. The unmodified letters are straightforward to read (except for maybe *=dhaal and E=ayin, v=thaa), but the transliteration of letters with diacritica and the harakat take some time to get used to, for example the nunatedi`rab-un, -an, -in appear as N, F, K, and the sukun ("no vowel") as o. Ta marbouta ة is p. • hamza • lone hamza: ' • hamza on alif: > • hamza on wa: & • hamza on ya: } • alif • madda on alif: | • alif al-wasla: { • dagger alif: ` • alif maqsura: Y • harakat • fatha: a • damma: u • kasra: i • fathatayn: F • dammatayn: N • kasratayn K • shadda: ~ • sukun: o • ta marbouta: p • tatwil: _

  24. The Equivalence Classes

  25. Normalization • They normalize Buckwalter and the English in the newspaper articles. • Thus, $  sh from Buckwalter, • ph  f in English, eliminate dupes, etc. • Move vowels from each language closer to one another by only retaining matching vowels (that is, where exist in both)

  26. Why different from Soundex and Editex • “What we do here is the opposite of the approach taken by the Soundex and Editex algorithms. They try to reduce the complexity by collapsing groups of characters into a single super-class of characters. The algorithm here does some of that with the steps that normalize the strings. However, the largest boost in performance is with CEQ, which expands the number of allowable cross-language matches for many characters.”

  27. Machine (Back-) Transliteration Kevin Knight and Jonathan Graehl University of Southern California

  28. Machine Transliteration • For Translation purposes • Foreign Words commonly transliterated, using approximate phonemic equivalents • “computer”  konpyuuta • Problem: Usually, translate by looking up in dictionaries, but these often don’t show up in dictionaries • Usually not a problem for some languages, like Spanish/English, since have similar alphabets. But non-alphabetic languages or with different alphabets, more problematic. (e.g. Japanese, Arabic) • Popular on the Internet: “The Coca-Cola name in China was first read as "Ke-kou-ke-la," meaning "Bite the wax tadpole" or "female horse stuffed with wax," depending on the dialect. Coke then researched 40,000 characters to find a phonetic equivalent to "ko-kou-ko-le," translating into "happiness in the mouth." “ • Solution: Backwards transliteration to get the original word, using a generative model

  29. Machine Transliteration • Japanese transliterates e.g. English in katakana. Foreign names and loan-words. • Compromises: e.g. golfbag • L/R map to same character • Japanese has alternating consonant vowel pattern, so cannot have consonant cluster LFB • Syllabary instead of alphabet. • Goruhubaggu • Dot separator, but inconsisent, so aisukuriimu can be “I scream” or “ice cream”

  30. Back Transliteration • Going from katakana back to original English word • for translation – katakana not found in bilingual dictionaries, so just generate original English (assuming it is English) • Yamrom 1994 – pattern matching – *** • Arbabi 1994 – neural net/expert system *** • Information loss, so not easy to invert

  31. More Difficult Than • Forward transliteration • several ways to transliterate into katakana, all valid, so you might encounter any of them • But only one English spelling; can’t say “arture” for “archer” • Romanization • we have seen examples of this;the katakana examples above • more difficult because of spelling variations • Certain things cannot be handled by back-transliteration • Onomatopoeia • Shorthand: e.g. waapuro = word processing

  32. Desired Features • Accuracy • Portability to other languages • Robust against OCR errors • Relevant to ASR where speaker has heavy accent • Ability to take context (topical/syntactic) into account, or at least return ranked list of possibilities • Really requires 100% knowledge

  33. Learning Approach – Initial Attempt • Can learn what letters transliterate for what by training on corpus of katakana phrases in bilingual dictionaries • Drawbacks: • with naïve approach, how can we make sure we get a normal transliteration? • E.g. we can get iskrym as back transliteration for aisukuriimu. • Take letter frequency into account! So can get isclim • Restrict to real words! Is crime. • We want ice cream!

  34. Modular Learning Approach Build generative model of transliteration process, • English phrase is written • Translator pronounces it in English • Pronunciation modified to fit Japanese sound inventory • Sounds are converted into katakana • Katakana is written Solve and coordinate solutions to these subproblems, use generative models in reverse direction Use probabilities and Bayes Rule

  35. Bayes’ Rule Example Example #1: Conditional probabilities – from Wikipedia Suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip cookies and 30 plain cookies, while bowl #2 has 20 of each. Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes's theorem. But first, we can clarify the situation by rephrasing the question to "what’s the probability that Fred picked bowl #1, given that he has a plain cookie?” Thus, to relate to our previous explanation, the event A is that Fred picked bowl #1, and the event B is that Fred picked a plain cookie. To compute Pr(A|B), we first need to know: Pr(A), or the probability that Fred picked bowl #1 regardless of any other information. Since Fred is treating both bowls equally, it is 0.5. Pr(B), or the probability of getting a plain cookie regardless of any information on the bowls. In other words, this is the probability of getting a plain cookie from each of the bowls. It is computed as the sum of the probability of getting a plain cookie from a bowl multiplied by the probability of selecting this bowl. We know from the problem statement that the probability of getting a plain cookie from bowl #1 is 0.75, and the probability of getting one from bowl #2 is 0.5, and since Fred is treating both bowls equally the probability of selecting any one of them is 0.5. Thus, the probability of getting a plain cookie overall is 0.75×0.5 + 0.5×0.5 = 0.625. Pr(B|A), or the probability of getting a plain cookie given that Fred has selected bowl #1. From the problem statement, we know this is 0.75, since 30 out of 40 cookies in bowl #1 are plain. Given all this information, we can compute the probability of Fred having selected bowl #1 given that he got a plain cookie, as such: As we expected, it is more than half.

  36. Application To Task At Hand English Phrase Generator produces word sequences according to probability distribution P(w) English Pronouncer probabilistically assigns a set of pronunciations to word sequence, according to P(p|w) Given pronunciation p, find word sequence that maximizes P(w|p) Based on Bayes’ Rule: P(w|p) = P(p|w) * P(w) / P(p) But P(p) will be the same regardless of the specific word sequence, so can just search for word sequence that maximizes P(p|w) * P(w), which are the two distributions we just modeled

  37. Five Probability Distributions Extending this notion, built 5 probability distributions • P(w) – generates written English word sequences • P(e|w) – pronounces English word sequences • P(j|e) – converts English sounds into Japanese sounds • P(k|j) – converts Japanese sounds into katakana writing • P(o|k) – introduces misspellings caused by OCR Parallels 5 steps above • English phrase is written • Translator pronounces it in English • Pronunciation modified to fit Japanese sound inventory • Sounds are converted into katakana • Katakana is written Given katakana string o observed by OCR, we wish to maximize: P(w) * P(e|w) * P(j|e) * P(k|j) * P(o | k) over all e, j, k Why? Lets say have e and want to determine most probable w given e – that is, P(w|e), would maximize P(w) * P(e|w) / P(e) Let us say had j and want to get most probable e given j – that is, P(e|j), would maximize P(e) * P(j|e). Note that while usually we ignore the divisor, here we maintain it. P(e) / P(e) = 1 And so on for each in turn.

  38. Implementation of the probability distributions P(w) as WFSA (weighted finite state acceptor), others as WFST (transducers) WFSA = state transition diagram with both symbols and weights on the transitions, such that some transitions more likely than others WFST = the same, but with both input and output symbols Implemented composition algorithm to yield P(x|z) from models P(x|y) and P(y|z), treating WFSAs simply as WFST with identical input and output Yields one large WFSA, and use Djikstra’s shortest path algorithm to extract most probable one No pruning, use Viterbi approximation, searching best path through WFSA rather than best sequence

  39. First Model – Word Sequences • “ice cream” > “ice crème” > “aice kreme” • Unigram scoring mechanism which multiplies scores of known words and phrases in a sequence • Corpus: WSJ corpus + online English name list + online gazeteer of place names • Should really e.g. ignore auxiliaries and favor surnames. Approximate by removing high frequency words

  40. Model 2 – Eng Word Sequences  Eng Sound Sequences • Use English phoneme inventory from CMU Pronunciation Dictionary, minus stress marks • 40 sounds: 14 vowel sounds, 25 consonant sounds (e.g. K, HH, R), additional symbol PAUSE • Dictionary has 100,000 (125,000) word pronunciation • Used top 50,000 words because of memory limitations • Capital letters – Eng sounds; lowercase words – Eng words

  41. Example Second WFST Note: Why not letters instead of phonemes? Doesn’t match Japanese transliteration mispronunciation, and that is modeled in next step.

  42. Model 3: English Sounds  Japanese Sounds • Information losing process: R, L  r, 14 vowels  5 Japanese vowels • Identify Japanese sound inventory • Build WFST to perform the sequence mapping • Japanese sound inventory has 39 symbols: 5 vowels, 33 consonants (including doubled kk), special symbol pause. • (P R OW PAUSE S AA K ER) (pro-soccer) maps to (p u r o pause s a kk a a) • Use machine learning to train WFST from 8000 pairs of English/Japanese sound sequences (for example, soccer). Created this corpus by modifying an English/katakana dictionary, converting into these sounds; used EM (estimation maximization) algorithm to generate symbol matching probabilities. See table on next page

  43. The EM Algorithm Note: pays no heed to context

  44. Model 4: Japanese sounds  Katakana • Manually construct 2. • #1 just merges sequential doubled sounds into single sound. o o  oo • #2 just does mapping, accounting for different spelling variation. e.g.

  45. Model 5: katakana  OCR

More Related