1 / 1

Construction of phoneme-to-phoneme converters

High level features. Orthography. Initial transcription. Target transcription. Alignment process (letter-to-sound). Alignment process (sound-to-sound). Transformation learning. Learn morphological classes. Example generation. Stochastic rule induction.

cyndi
Download Presentation

Construction of phoneme-to-phoneme converters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High level features Orthography Initial transcription Target transcription Alignment process (letter-to-sound) Alignment process (sound-to-sound) Transformation learning Learn morphological classes Example generation Stochastic rule induction Towards improved proper name recognition Bert Réveil and Jean-Pierre Martens DSSP group, Ghent University, Department of Electronics and Information Systems Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium {breveil,martens}@elis.ugent.be • Topic description • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Automatic proper name recognition is a key component of multiple speech-based applications (e.g. voice-driven navigation systems). This recognition is challenged by the mismatch between the way the names are represented in the recognizer and the way they are actually pronounced: • Incorrect phonemic name transcriptions: common grapheme-to-phoneme (G2P) converters can’t cope with archaic spelling and foreign name parts, manual transcriptions are too costly (e.g. Ugchelsegrensweg, Haînautlaan) • Multiple plausible name pronunciations: within or across languages (e.g. Roger) • Cross-lingual pronunciation variation: foreign names, foreign application users • In order to improve the phonemic transcriptions and capture the pronunciation variation we adopt acoustic and lexical modeling approaches. Acoustic modeling targets a better modeling of the expected utterance sounds. Lexical modeling tries to foresee the most plausible phonemic transcription(s) for each name in the recognition lexicon. Please guide me towards ‘A&u.stIn RECOGNITION SYSTEM GPS HMMs Lexicon … Austin 'O.stIn … “O” … • Experimental set-up • ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Database: Autonomata Spoken Name Corpus (ASNC) • 120 Dutch, 40 English, 20 French, 40 Moroccan and 20 Turkish speakers • Every speaker reads 181 names with either Dutch, English, French, Moroccan or Turkish origin • Non-overlapping train and test set (disjunctive names, speakers) • Human expert transcriptions • TY: typical Dutch transcription (one for each name from TeleAtlas) • AV: auditory verified Dutch transcription (one for each name utterance) • This work: only Dutch native utterances + non-native utterances of Dutch names • Speech recognizer: state-of-the-art VoCon 3200 from Nuance • Grammar: name loop with 21K different names (3.5K names of ASNC + 17.5K others) • Acoustic and lexical modeling strategies • ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • The modeling approaches are firstly conceived for the primary targeted users, also called the native (NAT) users (in our case Dutch natives). W.r.t. these users, two types of non-native languages are distinguished: foreign languages that most NAT speakers are familiar with (NN1), and other foreign languages (NN2). • Strategy 1: Incorporating NN1 language knowledge • Acoustic modeling: two model sets • AC-MONO : standard NAT Dutch model (trained on Dutch speech alone) • AC-MULTI : Dutch (20%) and NN1 training data (English, French and German) • Lexical modeling • G2P transcribers for NAT and NN1 languages (Nuance RealSpeak TTS) • Foreign transcriptions are nativized in combination with AC-MONO • Data-driven selection of one extra G2P converter per name origin • Strategy 2: Creating pronunciation variants (lexical modeling) • Computed per (speaker, name) combination • Created from initial G2P transcriptions by means of automatically learned phoneme-to-phoneme (P2P) converters • Construction of phoneme-to-phoneme converters • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • P2P learning requires the orthographic transcription, an initial G2P transcription and a target phonemic transcription (e.g. TY or AV) of a sufficiently large collection of name utterances. These 3-tuples are supplied to a 4 step training procedure: • Two-fold alignment: Orthography ↔ Initial transcription ↔ Target transcription • Transformation retrieval • Generation of training examples: describe linguistic context • Previous and next phonemes and graphemes • Lexical context (Part Of Speech) • Prosodic context (stressed syllable or not) • Morphological context (word prefix/suffix) • External features: e.g. name type, name source, speaker tongue • Rule induction • Learn decision tree per input (pattern): stochastic rules in leaf nodes • Rule formalism: if context→ leaf node then [input pattern] → [output pattern] with probability Pfir In generation mode: rules applied to initial G2P transcription of unseen name  variants with probabilities • Experimental assessment • -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- • Incorporating NN1 language knowledge • Including extra G2P transcriptions (acoustic model = AC-MONO) • Boost for (DU,-DU): NAT speakers use NN1 knowledge when reading foreign names, including NN2 names • Degradation for (DU,DU): reduced by selecting only one extra G2P • Decoding with multilingual acoustic model • NAT speakers: loss for NAT names, boost for English names only • Dutch sounds not as well modeled as before • English better known than French? • English and Dutch sound inventories differ more than French and Dutch? • Foreign speakers: boost for both NN1 name origins • mother tongue sounds better modeled • Plain multilingual G2P transcriptions bring no improvement • Creating pronunciation variants • Baseline P2Ps: Dutch G2P transcriptions as initials, AV transcriptions as targets • Alternative P2Psfor (DU,NN1) and (NN1,DU) cells • create additional P2P that starts from NN1 G2P transcriptions • combine most probable variants generated by both P2P converters • P2P variants lead to significant improvements for all (speaker, name) cells • 10 .. 25% relative for NAT + foreign names , 5 .. 17% for foreign speakers Acknowledgments -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------The presented work was carried out in the Autonomata TOO project, granted under the Dutch-Flemish STEVIN program (http://taalunieversum.org/taal/technologie/stevin/), with partners RU Nijmegen, Universiteit Utrecht, Nuance and TeleAtlas. References ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [1] B. Réveil, J.-P. Martens and B. D’hoore, How speaker tongue and name source language affect the automatic recognition of spoken names, in Proc. InterSpeech 2009, UK, Brighton [2] H. van den Heuvel, B. Réveil and J.-P. Martens, Pronunciation-based ASR for names, in Proc. InterSpeech 2009, UK, Brighton [3] B. Réveil, J.-P. Martens and H. van den Heuvel, Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon, in Proc. LREC 2010, Valletta, Malta

More Related