Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th April, 2011

CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 36–syllabification and transliteration) Pushpak BhattacharyyaCSE Dept., IIT Bombay 11thApril, 2011

Access from one language to another Language L1 Language L2 Translation Transliteration Boy लड़का Gandhijiगांधीजी

Cross Lingual Query Query Translation गांधीके जन्मस्थान Birthplace of Gandhi Search Engine English Document Translate Document Output

Sound based transliteration • Transliteration • Script Script • Transliteration using pronunciation • Convert script to sound • Convert sound to script • Similar to interlingua based MT (Gandhiji) (गांधीजी) Meaning Sentence in source Language Sentence in target Language

Two approaches to transliteration Syllabification Machine Learning Based Rule Based WordSyllabified Apple apple Fundamental fun da men tal Use GIZA++ Moses on parallel corpus

Phoneme-based approach Word in Source language Word in Target language P(wt) P( ps | ws) P ( wt | pt ) Pronunciation in Source language Pronunciation In target language P ( pt | ps ) Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) ) Note:Phoneme is the smallest linguistically distinctive unit of sound.

Basics of syllables “Syllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants.” • Vowels are the heart of a syllable (Most Sonorous Element) (svayamraajateitisvaraH) • Consonants act as sounds attached to vowels.

Syllable structure • A syllable consists of 3 major parts:- • Onset (C) • Nucleus (V) • Coda (C) • Vowels sit in the Nucleus of a syllable • Consonants may get attached as Onset or Coda. • Basic structure - CV

Possible syllable structures • The Nucleus is always present • Onset and Coda may be absent • Possible structures • V • CV • VC • CVC

Introduction to sonority theory “The Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech.” • Some sounds are more sonorous • Words in a language can be divided into syllables • Sonority theory distinguishes syllables on the basis of sounds.

Sonority hierarchy • Defined on the basis of amount of sound associated • The sonority hierarchy is as follows:- • Vowels (a, e, i, o, u) • Liquids (y, r, l, v) • Nasals (n, m) • Fricatives (s, z, f,…..sh, th etc.) • Affricates (ch, j) • Stops (b, d, g, p, t, k)

Sonority scale • Obstruents can be further classified into:- • Fricatives • Affricates • Stops

Sonority theory & syllables “A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.” • Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

Sonority sequencing principle “The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.” Peak (Nucleus) Onset Coda

Maximal onset principle “The Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.” • Determines underlying syllable division • Example • DIPLOMA DIP LO MA & DI PLO MA

Syllable Structure: a more detailed look S ≡ Syllable, O ≡ Onset R ≡ Rhyme, N ≡ Nucleus Co ≡ Coda Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word. Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”. Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the onset. Elements coming after the nucleus are called the coda. Nucleus and coda together are referred to as the rhyme.

Syllable Structure: Examples ‘word’ ‘sprint’

Syllable Structure: Examples  No Coda.  No Onset.  No Coda, No Onset. ‘may’ ‘opt’ ‘air’

Syllable Structure • Open Syllable: ends in vowel • Closed syllable: ends in consonant or consonant cluster • Light Syllable: A syllable which is open and ends in a short vowel • General Description – CV. • Example, ‘air’. • Heavy Syllable: Closed syllables or syllables ending in diphthong • Example: ‘opt’ • Example, ‘may’

Syllabification: Determining Syllable Boundaries • Given a string of syllables (word), what is the coda of one and the onset of another? • In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)? • To determine the correct groupings, there are some rules, two of them being the most important and significant: • Maximal Onset Principle, • Sonority Hierarchy

Rule Based Syllabification (This and statistical syllabification and application to Transliteration: DDP work by AnkitAggrwal, 2009)

Phonotactics • Guides construction of onsets and codas • Restricts combination of phonemes. • In general, rules operate around the sonority hierarchy. E.g., • Fricative /s/ is lower on the sonority hierarchy than the lateral /l/, so the combination /sl/ is permitted in onsets and /ls/ is permitted in codas. Opposite is not allowed. • Thus, ‘slips’ and ‘pulse’ are possible English words. • ‘lsips’ and ‘pusl’ are not possible.

Constraints on Onsets1 • One-consonant: Only /ŋ/ cannot occur in syllable-initial position. • Two-consonant: We refer to the scale of sonority. • Sequence ‘rn’ is ruled out since there is a decrease of sonority. • Minimal Sonority Distance: Distance in sonority between the first and the second element in the onset must be of at least 2 degrees. • Thus, only a limited no. of possible two-consonant clusters. • Three-consonant: • Restricted to licensed two-consonant onsets preceded by /s/. • Also, /s/ can only be followed by a voiceless sound. • Therefore, only /spl/, /spr/, /str/, /skr/, /spj/, /stj/, /skj/, /skw/, /skl/, /smj/ will be allowed. (splinter, spray, strong etc.) • While /sbl/, /sbr/, /sdr/, /sgr/, /sθr/ will be ruled out.

Constraints on Onsets2 Possible 2-consonat clusters in an Onset

Constraints on Coda1

Constraints on Coda2

Other Constraints • Nucleus: The following can occur as nucleus: • All vowel sounds (monophthongs as well as diphthongs). • Syllabic: • Both the onset and the coda are optional (as seen previously). • /j/ at the end of an onset (/pj/, /bj/, /tj/, /dj/, /kj/, /fj/, /vj/, /θj/, /sj/, /zj/, /hj/, /mj/, /nj/, /lj/, /spj/, /stj/, /skj/) must be followed by /uɪ/ or /ʊə/. • Long vowels and diphthongs are not followed by /ŋ/. • /ʊ/ is rare in syllable-initial position. • Stop + /w/ before /uɪ, ʊ, ʌ, aʊ/ are excluded.

Implementation1 1. Identify the first nucleus (single or multiple consecutive vowels) in the word. 2. Parse all the preceding consonants as the onset of the first syllable. 3. Find the next nucleus in the word. If unsuccessful in finding another nucleus, we'll simply parse the consonants to the right of the current nucleus as the coda of the first syllable, else we will move to the next step. Now the consonant cluster between these two nuclei have to be divided in two parts, one as coda of 1st syllable and the other onset of 2nd syllable. 4. If the no. of consonants in the cluster is one, it will be the onset of the second nucleus as per the Maximal Onset Principle and Constrains on Onset. 5. If it is two, check whether both of these can go to the onset of the second syllable as per the allowable onsets.

Implementation2 7. If the no. of consonants in the cluster is three, check if they can be the onset of the second syllable, if not, check for the last two, if not, parse only the last consonant as the onset of the second syllable. 8. If the no. of consonants in the cluster is more than three, except the last three consonants, parse all the consonants as the coda of the first syllable. With the remaining 3 consonants, apply step 7. 9. Continue the procedure for the rest of the string. 6. If not , the first consonant will be the coda of the first syllable and the second consonant will be the onset of the second syllable.

Accuracy Accuracy = (no. of words correctly syllabified x 100) % no. of words 10,000 easy words were chosen to establish proof of concept 91 words were found to be incorrectly syllabified. Accuracy: 99.09%. High precision, low recall (a characteristic of knowledge based AI)

Statistical syllabification

Statistical Machine Syllabification Data • Sources of data • Election Commission of India (ECI) Name List : This web source provides native Indian names written in both English and Hindi. • Delhi University (DU) Student List : This web sources provides native Indian names written in English only. These names were manually transliterated for the purposes of training data. • Indian Institute of Technology, Bombay (IITB) Student List: The Academic Office of IITB provided this data of students who graduated in the year 2007. • Named Entities Workshop (NEWS) 2009 English-Hindi parallel names : A list of paired names between English and Hindi of size 11k is provided. • We start with 8,000 randomly chosen names from the ECI Name list and syllabify them manually.

Statistical Machine Syllabification Effect of Training Format • Syllable-separated Format • Syllable-marked Format • Future Work

Statistical Machine Syllabification Top-1 Accuracy: 71.8% (Syllable-separated); 80.5% (Syllable-marked) Top-5 Accuracy: 83.4% (Syllable-separated); 90.4% (Syllable-marked)

Statistical Machine Syllabification Effect of Data Size • 4 different experiments were performed: • 8k: Names from the ECI Name list. • 12k: An additional 4k names were syllabified to increase the data size. • 18k: IITB Student List and the DU Student List was included and syllabified. • 23k: Some more names from ECI Name List and DU Student List were syllabified and this data acts as the final data for us.

Statistical Machine Syllabification Effect of Language Model n-gram Order

Effect of Syllabification on Transliteration

Transliteration using syllabification Input String Syllabified data (source–side) Moses Automatic Syllabifier Learnt syllabifier Learnt transliterator Syllabified String Moses Automatic Transliterator Syllabified data (parallel) Transliterated String

Statistical Machine Transliteration Top-1 Accuracy: 60.1% (Syllable-separated); 50.2% (Syllable-marked) Top-5 Accuracy: 87.2% (Syllable-separated); 79.3% (Syllable-marked)

The Top-5 Accuracy is just 63.0% which is much lower than required. No syllabification: Baseline Performance Additional Additional

Manual, i.e., ideal syllabification: skyline Performance Top-1 Accuracy: 71.8% (Syllable-separated); 80.5% (Syllable-marked) Top-5 Accuracy: 87.4% (Syllable-separated); 90.4% (Syllable-marked)

Obsevations • Transliteration likely to be helped by syllabification • Rule based syllabification: high precision low recall • Statistical syllabification: reasonable accuracy at top-5

References Nasreen AbdulJaleel and Leah S. Larkey. Statistical transliteration for english-arabic cross language information retrieval. In Conference on Information and Knowledge Management, pages 139–146, 2003. Ann K. Farmer Andrian Akmajian, Richard M. Demers and Robert M. Harnish. Linguistics: An Introduction to Language and Communication. MIT Press, 5th edition, 2001. Association for Computer Linguistics. Collapsed Consonant and Vowel Models: New Approaches for English-Persian Transliteration and Back-Transliteration, 2007. Slaven Bilac and Hozumi Tanaka. Direct combination of spelling and pronunciation information for robust back-transliteration. In Conferences on Computational Linguistics and Intelligent Text Processing, pages 413–424, 2005. Ian Lane Bing Zhao, Nguyen Bach and Stephan Vogel. A log-linear block transliteration model based on bi-stream hmms. HLT/NAACL-2007, 2007.

References H. L. Jin and K. F. Wong. A Chinese dictionary construction algorithm for information retrieval. In ACM Transactions on Asian Language Information Processing, pages 281–296, December 2002. K. Knight and J. Graehl. Machine transliteration. In Computational Linguistics, pages 24(4):599–612, Dec. 1998. Lee-Feng Chien Long Jiang, Ming Zhou and Chen Niu. Named entity translation with web mining and transliteration. In International Joint Conference on Artificial Intelligence (IJCAL-07), pages 1629–1634, 2007. Dan Mateescu. English Phonetics and Phonological Theory. 2003. Della Pietra P. Brown and R. Mercer. The mathematics of statistical machine translation: Parameter estimation. In Computational Linguistics, page 19(2):263U˝ 311, 1990. Ganapathiraju Madhavi Prahallad Lavanya and Prahallad Kishore. A simple approach for building transliteration editors for Indian languages. Zhejiang University SCIENCE-2005, 2005.

Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th April, 2011

Presentation Transcript

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 3 rd and 7 th Feb, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 21 st March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 31 st March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 22 nd March, 2011

Pushpak Bhattacharyya CSE Dept . IIT Bombay 1 st Nov, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay 12 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th and 18 th Oct, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay 7 th April, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 29 th March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th Feb, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th , 10 th March , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 17 th March, 2011

Pushpak Bhattacharyya CSE Dept. IIT Bombay 19 May, 2014

Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 15 th March, 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 20 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 10 th Jan , 2011

Pushpak Bhattacharyya CSE Dept., IIT Bombay 11 th Nov, 2012

Pushpak Bhattacharyya CSE Dept., IIT Bombay

Pushpak Bhattacharyya CSE Dept., IIT Bombay 24 th March, 2011