280 likes | 498 Views
Transliteration. CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906. Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how?.
E N D
Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916 Aditya Joshi 08305908 Manasi Bapat 08305906
Humans transliterate frequently for different reasons Can a machine do this? (Why would a machine have to do this?) If yes, how? Picture courtesy: Snapshot of Yahoo! Messenger
Motivation • An important component of machine translation • When you cannot translate, transliterate • Generally used for named entities, technical terms and out of vocabulary words (OOV) • Issues specific to sounds, scripts and accents • Can a machine do this? If yes, how?
What is transliteration? • Task of converting a word from one alphabetic script to another Used for: • Named entities • : Gandhiji • Out of vocabulary words • : Bank
Linguistic issues • Accents : Thoda or thora? • Mapping of sounds • Mahaan: Kahaan: • Back-transliteration
Linguistic Issues : Mapping of sounds Arabic Chinese Hindi / Japanese • Arabic b -> English p or b • English word: Paul transliterates to • Arabic word: Baul (issue in • Back-transliteration) • Origin of the proper noun determines • the symbol in Chinese language • Ideographic symbols in Chinese • Several English symbols do not map • to any Japanese symbols. So, often • mapped to closest sounding symbol • ice cream aisukuriimu • Symbols map to different symbols • based on their position • America • Difference in origin • Restaurant • constant
x Overview Source String Target String Transliteration Units Transliteration Units
Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based
Phoneme-based approach Word in Source language Word in Target language P(wt) P( ps | ws) P ( wt | pt ) Pronunciation in Source language Pronunciation In target language P ( pt | ps ) Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) ) Note:Phoneme is the smallest linguistically distinctive unit of sound.
Phoneme-based approach Transliterating ‘BAPAT’ B A P A T Source word to phonemes B /ə/ /a:/ P /ə/ /a:/ T t Source phonemes to target phonemes B /ə/ /a:/ P /ə/ /a:/ T t Step II : Converting to phoneme seq. Step III : Converting to target phoneme seq. Step I : Consider each character of the word
Phoneme-based approach Step IV : Phoneme sequence to target string B : /ə/ : /a:/ : P: /ə/ : T: t: /a:/ : Output :
Concerns Check if the world is valid In target language Word in Source language Word in Target language Check if environment Is noise-free Pronunciation in Source language Pronunciation In target language
Issues in phonetic model • Unknown pronunciations • Back-transliteration can be a problem Johnson Jonson sanhita samhita
Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based
Spelling-based model • Maps source word sequences to target word sequences (i.e. direct word to word) • The transliteration score: • P(w) Word in Target language Word in Source language Pronunciation in Source language Pronunciation In target language Letter trigram model included Thus, we can accommodate the words not included in the dictionary
Contents Source String Target String Transliteration Units Transliteration Units Phoneme- based Spelling- based Joint Source Channel
The Third Method - Why? • Particularly developed for Chinese • Chinese : Highly ideographic • Example : • Two main steps: Modeling Decoding Image courtesy: wikimedia-commons
Modeling Step Modeling step • A bilingual dictionary in the source and target language • From this dictionary, the character mapping between the source and target language is learnt The word “Geo” has two possible mappings, the “context” in which it occurs is important
Modeling step … Modeling step … • N-gram Mapping : • < Geo, > < rge, > • < Geo, > < lo, > • This concludes the modeling step
Decoding Step Decoding step • Consider the transliteration of the word “George”. • Alignments of George: • GeorgeGeorge • GeorgeGeorge
Decoding step … Decision to be made between…. • The context mapping is present in the map-dictionary • Using ……
Transliteration Alignment • Where do the n-gram statistics come from? Ans.: Automatic analysis of the bilingual dictionary • How to align this dictionary? Ans. : Using EM-algorithm
EM Algorithm Bootstrap Bootstrap initial random alignment Update n-gram statistics to estimate probability distribution Apply the n-gram TM to obtain new alignment Expectation Derive a list of transliteration units from final alignment Maximization Transliteration Units
Evaluation E2C Error rates for n-gram tests E2C v/s C2E for TM Tests
Conclusion • Transliteration can make use of phonemes as an intermediate layer to move from a script to another • Spelling-based approach connects the word sequences of the two languages • The joint source channel method integrates optimization of alignment and transliteration • no pre-alignment needed • reduction in development efforts
References • For all Devnagari transliterations, www.quillpad.in/hindi/ • Phoneme and spelling-based models K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612. N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146. Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages. • Joint source-channel model H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166. www.wikipedia.org