Broadcast News Transcription System for Khmer Language

Broadcast News Transcription System for Khmer Language S. Seng1,2,3, S. Sam1,2,3, B. Bigi1, L. Besacier1, E. Castelli2 1LIG, Grenoble, France 2 MICA, Hanoi, Vietnam 3 ITC, Phnom Penh, Cambodia Sopheap.Seng@imag.fr

Outline • ASR for Khmer: the challenges • Language data acquisition • Word/Sub-word Language Modeling • Acoustic Modeling • Experiments & Results • Conclusion & future work S. SENG, Lrec'08 Marrakech

Khmer Language • Official language of Cambodia • Spoken by more than 15 M people • An atonal language • Writing system • 33 Consonants, 23 dependent vowels • 14 independent vowels, 13 diacritics and various signs • No explicit word boundary S. SENG, Lrec'08 Marrakech

ASR for Khmer: the challenges • An under-resourced language • Lack of text and speech data in digital form • Lacking explicit Word Segmentation • Automatic Segmentation is needed to make language modeling feasible • State-of-the-art method of segmentation • Uses hand-crafted lexicons, statistic, optimization criteria … • Error-prone • Others under-resourced, unsegmented languages in the region: Burmese, Laos, Vietnamese … S. SENG, Lrec'08 Marrakech

Language data acquisition: text data • Retrieving text from the Web • Well selected rich-content websites Vs crawling the Web • Adapting ClipsTextTk, an open source tool for corpus creation for Khmer language • Conversion from legacy character encoding to Unicode • Automatic Segmentation • Conversion of special sign and number to text • Normalization of word spelling • Text Corpus obtained from 5 well selected sites: • 2.5000 html documents retrieved • After processing : 0.5 M sentences, 15 M words • Duration : November 2007 – January 2008 S. SENG, Lrec'08 Marrakech

Language data acquisition: text data • An example of segmentation of Khmer text • Word Segmentation • Based on 18k Lexicons from the official Chhoun Nat dictionary • Optimization criteria : longest matching • Imperfect segmentation • Syllable Segmentation • Rule based (20 rules) • Imperfect segmentation • Character Cluster (CC) Segmentation • CC is a group of characters which has a well defined structure • CC Segmentation is a trivial task S. SENG, Lrec'08 Marrakech

Language data acquisition: speech signal • Speech data collection • Downloadable Khmer Radio programs • Sites: Voice of America, Free Asia, Radio Australia … • Quality: Narrowband, poor quality • Recording of local Radio Broadcast News, Phnom Penh • Manual transcription campaign • Volunteers students contribute to do the transcription • 6h30mn of transcript signal obtained (reading news in the studio) • 3000 sentences, 45k words, 8 speakers (3 women) S. SENG, Lrec'08 Marrakech

Statistical Language Modeling • Problematic • Very limited quantity of text data • Error Word Segmentation, High OOV rate • Is word still an optimal modeling unit? • In the literature, alternative modeling units were used : • Morpheme for Morphologically rich language [MorphoChallenge 05] • Logographic character for Japanese [Den 06] • Logographic character + word for Chinese [Chen 00] • Syllable for Vietnamese [Le 06] • Sub-word units in Khmer : Syllable, Character Cluster S. SENG, Lrec'08 Marrakech

Statistical Language Modeling • Using Word & Sub-Word in language modeling • Exploit different views from the same data • Deal with OOV problem • Compensate the error introduced by automatic segmentation • Create Hybrid LM by combining Word/Sub-word S. SENG, Lrec'08 Marrakech

Acoustic modeling • Grapheme based Pronunciation • A grapheme is directly a modeling unit • Grapheme-to-Phoneme Rule based Pronunciation • General Khmer syllable structure : C[C]V[CF] • 20 conversion rules formed based on rule in [Huffman 75] • Not all word could be phonetized by the rules (especially words from Sanskrit and Pali) S. SENG, Lrec'08 Marrakech

Acoustic modeling • Khmer Phonemes inventory Source [Huffman 75] S. SENG, Lrec'08 Marrakech

Experiments: ASR system • Decoder • Sphinx V3.6 • Test Corpus • 172 utterance • Acoustic model HMM training • SphinxTrain • Grapheme based : 77 modeling units • Phoneme based : 33 modeling units (single phone) • Model Context-Independent and Context-Dependent • 3-grams LM • Word/Sub-Word LM : Word, Syllable, CC • Hybrid LM : CC + N most frequent word (N vary from 0 to 20k) • Vocabulary : 20k most frequent word, 8800 syllables, 3500 CC • Evaluation metric • WER (Word Error Rate) • SER (Syllable Error Rate) • CCER (Character Cluster Error Rate) S. SENG, Lrec'08 Marrakech

Experiments: baseline results • Grapheme Vs Phoneme • Performance of Grapheme-based and Phoneme-based models is comparable • the potential of Grapheme-Based approach S. SENG, Lrec'08 Marrakech

Experiments: Word/Sub-Word LM • Comparison of Word and Sub-word LMs • A Khmer word is in average composed of 3.2 syllables and 4.3 CC S. SENG, Lrec'08 Marrakech

Experiments: Hybrid LMs • Hybrid LMs: • progressively add N most frequent word to CC vocabulary to create Hybrid vocabularies • The small size V5K vocabulary give a comparable performance to Word based LM S. SENG, Lrec'08 Marrakech

Conclusion & Future work • ASR for Khmer, an under-resourced, unsegmented language • Tools for language data acquisition and processing • Word/Sub-word unit for language modeling: Hybrid LMs • Grapheme-based Vs Grapheme-to-Phoneme Rule based acoustic modeling • Future Work • Discounting illegal Sub-word sequences • A tree structure for Word/Sub-word on LM level • Systems Combination scheme based on lattice combination S. SENG, Lrec'08 Marrakech

Question or Suggestion ? S. SENG, Lrec'08 Marrakech

Reference [MorphoChallenge 05] M. Kurimo and all. Unsupervised segmentation of words into morphemes - Morpho Challenge 2005: Application to Automatic Speech Recognition. In Proc. Interspeech, pages 1021-1024, Pittsburgh, PA, 2006 [Le 06] Viet-Bac Le, «Reconnaissance automatique de la parole pour les langues peu dotées », Thèse de doctorat de l’Université J. Fourier - Grenoble I, France, 2006 [Den 06] E. Denoual and Y. Lepage. The character as an appropriate unit of processing for non-segmenting languages. NLP Annual Meeting, pages 731-734, Tokyo, Japan, 2006. [Huffman 70] Huffman, Franklin, «Cambodian system of writing and begining reader ». Yales University Press, 1970 S. SENG, Lrec'08 Marrakech

Broadcast News Transcription System for Khmer Language

Broadcast News Transcription System for Khmer Language

Presentation Transcript

US Broadcast News: History

Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0

BROADCAST NEWS DAY 2

Khmer Rouge

Writing a News Broadcast

Progress in Arabic Broadcast News Transcription at BBN

Khmer

Broadcast News Training Experiments

Writing Broadcast News Stories

Story Segmentation of Broadcast News

Broadcast News (1987)

Broadcast News Refresher Lecture

Voice Broadcast System

The Khmer

Broadcast News Writing

How to Secure Future of Khmer Language with Khmer Translation?

Best Language Transcription Services

Writing News for Broadcast

Center for Khmer Studies

Transcription Management System

Story Segmentation of Broadcast News