190 likes | 315 Views
Broadcast News Transcription System for Khmer Language. S. Seng 1,2,3 , S. Sam 1,2,3 , B. Bigi 1 , L. Besacier 1 , E. Castelli 2 1 LIG, Grenoble, France 2 MICA, Hanoi, Vietnam 3 ITC, Phnom Penh, Cambodia Sopheap.Seng@imag.fr. Outline. ASR for Khmer: the challenges
E N D
Broadcast News Transcription System for Khmer Language S. Seng1,2,3, S. Sam1,2,3, B. Bigi1, L. Besacier1, E. Castelli2 1LIG, Grenoble, France 2 MICA, Hanoi, Vietnam 3 ITC, Phnom Penh, Cambodia Sopheap.Seng@imag.fr
Outline • ASR for Khmer: the challenges • Language data acquisition • Word/Sub-word Language Modeling • Acoustic Modeling • Experiments & Results • Conclusion & future work S. SENG, Lrec'08 Marrakech
Khmer Language • Official language of Cambodia • Spoken by more than 15 M people • An atonal language • Writing system • 33 Consonants, 23 dependent vowels • 14 independent vowels, 13 diacritics and various signs • No explicit word boundary S. SENG, Lrec'08 Marrakech
ASR for Khmer: the challenges • An under-resourced language • Lack of text and speech data in digital form • Lacking explicit Word Segmentation • Automatic Segmentation is needed to make language modeling feasible • State-of-the-art method of segmentation • Uses hand-crafted lexicons, statistic, optimization criteria … • Error-prone • Others under-resourced, unsegmented languages in the region: Burmese, Laos, Vietnamese … S. SENG, Lrec'08 Marrakech
Language data acquisition: text data • Retrieving text from the Web • Well selected rich-content websites Vs crawling the Web • Adapting ClipsTextTk, an open source tool for corpus creation for Khmer language • Conversion from legacy character encoding to Unicode • Automatic Segmentation • Conversion of special sign and number to text • Normalization of word spelling • Text Corpus obtained from 5 well selected sites: • 2.5000 html documents retrieved • After processing : 0.5 M sentences, 15 M words • Duration : November 2007 – January 2008 S. SENG, Lrec'08 Marrakech
Language data acquisition: text data • An example of segmentation of Khmer text • Word Segmentation • Based on 18k Lexicons from the official Chhoun Nat dictionary • Optimization criteria : longest matching • Imperfect segmentation • Syllable Segmentation • Rule based (20 rules) • Imperfect segmentation • Character Cluster (CC) Segmentation • CC is a group of characters which has a well defined structure • CC Segmentation is a trivial task S. SENG, Lrec'08 Marrakech
Language data acquisition: speech signal • Speech data collection • Downloadable Khmer Radio programs • Sites: Voice of America, Free Asia, Radio Australia … • Quality: Narrowband, poor quality • Recording of local Radio Broadcast News, Phnom Penh • Manual transcription campaign • Volunteers students contribute to do the transcription • 6h30mn of transcript signal obtained (reading news in the studio) • 3000 sentences, 45k words, 8 speakers (3 women) S. SENG, Lrec'08 Marrakech
Statistical Language Modeling • Problematic • Very limited quantity of text data • Error Word Segmentation, High OOV rate • Is word still an optimal modeling unit? • In the literature, alternative modeling units were used : • Morpheme for Morphologically rich language [MorphoChallenge 05] • Logographic character for Japanese [Den 06] • Logographic character + word for Chinese [Chen 00] • Syllable for Vietnamese [Le 06] • Sub-word units in Khmer : Syllable, Character Cluster S. SENG, Lrec'08 Marrakech
Statistical Language Modeling • Using Word & Sub-Word in language modeling • Exploit different views from the same data • Deal with OOV problem • Compensate the error introduced by automatic segmentation • Create Hybrid LM by combining Word/Sub-word S. SENG, Lrec'08 Marrakech
Acoustic modeling • Grapheme based Pronunciation • A grapheme is directly a modeling unit • Grapheme-to-Phoneme Rule based Pronunciation • General Khmer syllable structure : C[C]V[CF] • 20 conversion rules formed based on rule in [Huffman 75] • Not all word could be phonetized by the rules (especially words from Sanskrit and Pali) S. SENG, Lrec'08 Marrakech
Acoustic modeling • Khmer Phonemes inventory Source [Huffman 75] S. SENG, Lrec'08 Marrakech
Experiments: ASR system • Decoder • Sphinx V3.6 • Test Corpus • 172 utterance • Acoustic model HMM training • SphinxTrain • Grapheme based : 77 modeling units • Phoneme based : 33 modeling units (single phone) • Model Context-Independent and Context-Dependent • 3-grams LM • Word/Sub-Word LM : Word, Syllable, CC • Hybrid LM : CC + N most frequent word (N vary from 0 to 20k) • Vocabulary : 20k most frequent word, 8800 syllables, 3500 CC • Evaluation metric • WER (Word Error Rate) • SER (Syllable Error Rate) • CCER (Character Cluster Error Rate) S. SENG, Lrec'08 Marrakech
Experiments: baseline results • Grapheme Vs Phoneme • Performance of Grapheme-based and Phoneme-based models is comparable • the potential of Grapheme-Based approach S. SENG, Lrec'08 Marrakech
Experiments: Word/Sub-Word LM • Comparison of Word and Sub-word LMs • A Khmer word is in average composed of 3.2 syllables and 4.3 CC S. SENG, Lrec'08 Marrakech
Experiments: Hybrid LMs • Hybrid LMs: • progressively add N most frequent word to CC vocabulary to create Hybrid vocabularies • The small size V5K vocabulary give a comparable performance to Word based LM S. SENG, Lrec'08 Marrakech
Conclusion & Future work • ASR for Khmer, an under-resourced, unsegmented language • Tools for language data acquisition and processing • Word/Sub-word unit for language modeling: Hybrid LMs • Grapheme-based Vs Grapheme-to-Phoneme Rule based acoustic modeling • Future Work • Discounting illegal Sub-word sequences • A tree structure for Word/Sub-word on LM level • Systems Combination scheme based on lattice combination S. SENG, Lrec'08 Marrakech
Question or Suggestion ? S. SENG, Lrec'08 Marrakech
Reference [MorphoChallenge 05] M. Kurimo and all. Unsupervised segmentation of words into morphemes - Morpho Challenge 2005: Application to Automatic Speech Recognition. In Proc. Interspeech, pages 1021-1024, Pittsburgh, PA, 2006 [Le 06] Viet-Bac Le, «Reconnaissance automatique de la parole pour les langues peu dotées », Thèse de doctorat de l’Université J. Fourier - Grenoble I, France, 2006 [Den 06] E. Denoual and Y. Lepage. The character as an appropriate unit of processing for non-segmenting languages. NLP Annual Meeting, pages 731-734, Tokyo, Japan, 2006. [Huffman 70] Huffman, Franklin, «Cambodian system of writing and begining reader ». Yales University Press, 1970 S. SENG, Lrec'08 Marrakech