140 likes | 318 Views
Progress in Arabic Broadcast News Transcription at BBN. Mohamed Afify, Long Nguyen, John Makhoul STT Workshop Philadelphia, PA, March 24, 2005. Overview. Problems in Arabic speech recognition Arabic Treebank and Buckwalter morphological analyzer Building phonetic systems in Arabic
E N D
Progress in Arabic Broadcast News Transcription at BBN Mohamed Afify, Long Nguyen, John Makhoul STT Workshop Philadelphia, PA, March 24, 2005
Overview • Problems in Arabic speech recognition • Arabic Treebank and Buckwalter morphological analyzer • Building phonetic systems in Arabic • Comparison of phonetic and grapheme models • Experimental results • Summary and future work
Problems in Arabic Speech Recognition • Lack of short vowels from existing corpora • Creates ambiguity for acoustic and language models • Most systems rely on grapheme-based acoustic models • No explicit models for short vowels • Therefore, no detailed phonetic acoustic models • Language models also ignore short vowels • Affixes create a large number of “words” • e.g., “and he will write it” is one word in Arabic • OOV rate is around 5% for 64K lexicon compared to around 0.5% for English • Morphological richness also adds to the large number of words
Possible Solutions • Short Vowels • Obtain vowelization of words in dictionary from Arabic Treebank and morphological analysis • Bootstrap acoustic-phonetic models for all phonemes, including short vowels • Expand vowelization process to language model • Affixes and morphological richness • Reduce OOV rate by increasing lexicon size • Use morphological analysis to decompose words into components • Current focus • Bootstrap acoustic models for short vowels • Build phonetic system • Available resources • No vowelized speech corpus • Arabic Treebank • Buckwalter morphological analyzer
LDC Arabic Treebank • Text only; no speech • Consists of three parts • The words in the articles in Parts 1 and 2 are vowelized in context • The unique words in Part 3 have multiple pronunciations based on the Buckwalter morphological analyzer
Buckwalter Morphological Analyzer • Available from LDC • Uses a lexicon and a set of rules for affixes to • Assign parts of speech to a word • Produce different vowelizations for each word • Version 2.0 was recently released • Several additional new features • Produces all possible ending vowelizations for input word • Can only analyze words whose stems are in its lexicon • Lexicon has about 40K stems • Does not include many foreign words • Does not deal with mis-spelled words
Building an Arabic Phonetic System • Use Arabic Treebank and Buckwalter morphological analyzer to bootstrap short vowels for acoustic training data and recognition lexicon • Method 1 • Search word in Treebank dictionary • If not found, pass to morphological analyzer • If both fail, discard word or manually vowelize • Method 2 • Pass word to morphological analyzer • If failed, lookup in Treebank dictionary • If both fail, discard word or manually vowelize • As a result, some acoustic training data and words in recognition lexicon were discarded • We found Method 2 to give more consistent vowelizations than Method 1
Arabic Phonetic System (cont’d) • Starting with 100 hrs of possible acoustic training data and a 64K recognition lexicon, we were able to keep: • 80 hrs (63K utterances) of data with short vowels • 62K recognition lexicon with short vowels • A 35-phoneme set (28 consonants + 6 vowels + “taa marbuuTa”) • Phonetic transcription rules are relatively straightforward starting from vowelized transcriptions • Built a conventional phonetic system and compared to grapheme system • No vowelization for language model
Initial Results • Dev 03, unadapted results, Method 1 vowelization • Normalization I : Normalize “hamza” at beginning of the word • Normalization II : Normalize “hamza” at beginning of the word, after popular prefixes, and also frequent “Y” and “y” confusions at end of word • Text normalization is much more important for phonetic system
Updated Development Results • Use Normalization II on acoustic and language training data, and for scoring • Use Method 2 to bootstrap short vowels • Expanded phonetic transcription rules to include assimilation of word-initial hamza and definite article • Dev03 test set, unadapted decoding • About 13% improvement for phonetic system
Experimental Results • About 80 hrs of net acoustic training data • ML models for un-adapted decoding • ML SAT models for adapted decoding • About 300M words of language training data • 3-gram language models • 60K recognition lexicon • Adapted decoding on different test sets
Next Immediate Steps • Use all 100 hrs for acoustic training • Phonetic models can automatically vowelize discarded sentences • Possibly manually vowelize missing words • Use 64K recognition lexicon • Manually vowelize missing words • Gain is about 1% absolute on Dev03 for grapheme system • Switch to MMI models for un-adapted and adapted decoding
Summary and Future Work • Quickly bootstrap phonetic system for Arabic • Text normalization and Buckwalter morphological analyzer version II are key to success • From 8%-13.5% improvement over grapheme system for different test sets • Further improvement can be obtained by straightforward upgrades • Future work • Using vowelization in language model • Increase lexicon size to reduce OOV rate • Statistical vowelization for missed words, mainly foreign names