10 likes | 128 Views
High level features. Orthography. Initial transcription. Target transcription. Alignment process (letter-to-sound). Alignment process (sound-to-sound). Transformation learning. Learn morphological classes. Example generation. Stochastic rule induction.
E N D
High level features Orthography Initial transcription Target transcription Alignment process (letter-to-sound) Alignment process (sound-to-sound) Transformation learning Learn morphological classes Example generation Stochastic rule induction Towards improved proper name recognition Bert Réveil and Jean-Pierre Martens DSSP group, Ghent University, Department of Electronics and Information Systems Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium {breveil,martens}@elis.ugent.be • Topic description • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced: • Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P) converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan) • Multiple plausible name pronunciations: within or across languages (e.g. Roger) • Cross-lingual pronunciation variation: foreign names, foreign application users • In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon. Please guide me towards ‘A&u.stIn RECOGNITION SYSTEM GPS HMMs Lexicon … Austin 'O.stIn … “O” … • Experimental set-up • ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Database: Autonomata Spoken Name Corpus (ASNC) • 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers • Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin • Non-overlapping train and test set (disjunctive names, speakers) • Human expert transcriptions • TY: typical Dutch transcription (one for each name from TeleAtlas) • AV: auditory verified Dutch transcription (one for each name utterance) • This work: only Dutch native utterances + non-native utterances of Dutch names • Speech recognizer: state-of-the-art VoCon 3200 from Nuance • Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others) • Acoustic and lexical modeling strategies • ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2). • Strategy 1: Incorporating NN1 language knowledge • Acoustic modeling: two model sets • AC-MONO : standard NAT Dutch model (trained on Dutch speech alone) • AC-MULTI : Dutch (20%) and NN1 training data (English, French and German) • Lexical modeling • G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS) • Foreign transcriptions are nativized in combination with AC-MONO • Data-driven selection of one extra G2P converter per name origin • Strategy 2: Creating pronunciation variants (lexical modeling) • Computed per (speaker, name) combination • Created from initial G2P transcriptions by means of automatically learned phoneme-to-phoneme (P2P) converters • Construction of phoneme-to-phoneme converters • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure: • Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription • Transformation retrieval • Generation of training examples: describe linguistic context • Previous and next phonemes and graphemes • Lexical context (Part Of Speech) • Prosodic context (stressed syllable or not) • Morphological context (word prefix/suffix) • External features: e.g. name type, name source, speaker tongue • Rule induction • Learn decision tree per input (pattern): stochastic rules in leaf nodes • Rule formalism: if context→ leaf node then [input pattern] → [output pattern] with probability Pfir In generation mode: rules applied to initial G2P transcription of unseen name variants with probabilities • Experimental assessment • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Incorporating NN1 language knowledge • Including extra G2P transcriptions (acoustic model = AC-MONO) • Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names • Degradation for (DU,DU): reduced by selecting only one extra G2P • Decoding with multilingual acoustic model • NAT speakers: loss for NAT names, boost for English names only • Dutch sounds not as well modeled as before • English better known than French? • English and Dutch sound inventories differ more than French and Dutch? • Foreign speakers: boost for both NN1 name origins • mother tongue sounds better modeled • Plain multilingual G2P transcriptions bring no improvement • Creating pronunciation variants • Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets • Alternative P2Psfor (DU,NN1) and (NN1,DU) cells • create additional P2P that starts from NN1 G2P transcriptions • combine most probable variants generated by both P2P converters • P2P variants lead to significant improvements for all (speaker, name) cells • 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers Acknowledgments -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas. References ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta