310 likes | 464 Views
The Creation of Emotional Effects for An Arabic Speech Synthesis System. Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. Sherif Mahdy Abdou. Agenda. Motivations & Applications Emotional Synthesis Approaches
E N D
The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. SherifMahdyAbdou
Agenda • Motivations & Applications • Emotional Synthesis Approaches • Unit Selection & Blending Data Approach • Arabic Speech Database & Challenges • Festival – Speech Synthesis framework • Arabic Voice Building • Proposed Utterance Structure • Proposed Target Cost Function • System Evaluation & Results • Conclusion & Future Works
Motivations & Applications • Emotions are inseparable components of the natural human speech. • Because of that, the level of human speech can only be achieved with the ability to synthesize emotions. • As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.
Motivations & Applications • Emotional speech synthesis can be applied to different applications like: • Spoken Dialogue Systems • Customer-care centers • Task planning • Tutorial systems • Automated agents • Future trends are towards approaching Artificial Intelligence
Emotional Synthesis Approaches • Attempts to add emotion effects to synthesised speech have existed for more than a decade. • Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used. • Different emotional synthesis techniques provide control over acoustic parameters to very different degrees. • Important emotional related acoustic parameters like: Pitch, Duration, speaking rate …etc
Emotional Synthesis Approaches • Emotional Formant Synthesis • Also known as rule-based synthesis • No human speech recordings are involved at run time. • The resulting speech sounds relatively unnatural and “robot-like”. • Emotional Diphone Concatenation • Use Diphone concatenation on Emotional recordings • A majority reports shows a degree of success in conveying some emotions. • May harm voice quality
Emotional Synthesis Approaches • HMM-Based Parametric synthesis • Use HMM models to change speaking style and emotional expression of the synthetic speech. • Train models on speech produced with different speaking styles. • Requires several hundreds of sentences spoken in a different speaking style. • Emotional-Based Unit-Selection • Use variable length units • Units that are similar to the given target are likely to be selected. • It is the used approach in our proposed system.
Emotional Unit Selection • Units are selected from large databases of natural speech. • The quality of unit selection synthesis depends heavily on the quality of the speaker and on the coverage of the database. • The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units. • Two major cost functions are used • Joint Cost function • Target Cost function
Blending Data Approach • Two main techniques of building emotional voices for unit selection • Tiering techniques • Limited domain synthesis. • Requires a large number of databases for different emotions. • Blending techniques • General purpose application. • All databases are located in one data pool used for selection. • … used technique in our system. • In blending approach, The system will choose the units from only one Database. • Requires careful selection criterion to match target speaking style.
Arabic Speech Database • In our system we used RDI TTS Saudi speaker database. • This database consists of 10 hours of recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning. • The EGG signal is also recorded to support pitch marking during the synthesis process. • The database is for male speaker sampled with 16 kHZ sampling rate. • HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries
Festival – Speech Synthesis framework • The Festival TTS system was developed at the University of Edinburgh by Alan Black and Paul Taylor. • Festival is primarily a research toolkit for speech synthesis (TTS). • It provides a state-of-the-art unit selection synthesis module called “Multisyn”. • Festival uses a data structure called an utterance structure that consists of items and relations. • An utterance represents some chunk of text that is to be rendered as speech
Festival – Speech Synthesis framework • Example of utterance structure:
Arabic Voice Building • Voice building phase is one of the major steps in building a unit selection synthesizer. • The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used: • Generate power factors and wave file normalization • Using EGGs for accurate pitch marking extraction • Generating LPC and residuals from wave files • Generate MFCCs, pitch values and spectral coefficients
Arabic Voice Building(Pitch marking) • In unit selection; Accurate estimation of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech. • We used EGG signal to extract pitchmarks. • Pitchmarking enhancements has been carried out by matrix optimization process. • The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.
Arabic Voice Building(Pitch marking) • Example of the default parameters of the pitchmarking application versus the optimized ones. Default Pitchmarking Optimized Pitchmarking
Proposed Utterance Structure • The HMM-based alignment transcription is converted to the utterance structure. • ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files. • The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information. • A new feature called emotion has been added to the word item type in the utterance structure.
Proposed Utterance Structure • The proposed system is designed to work with three emotional state (normal, sad and question). • So, The emotion feature takes one of three emotional values. • Normal Normal state • Sad Sad emotion state • Question Question emotion sate
Proposed Target Cost Function • The target cost is a weighted sum of functions that check if features in the target utterance match those features in the candidate utterance. • The standard target cost in festival does not contain any computation differences for emotional state of the utterance. • A new emotional target cost is proposed
Proposed Emotional Target Cost Function • The target cost: • The Emotional target cost:
Proposed Target Cost Function • The algorithm in general favors units that are similar in the emotional sate or classified to be emotional in a two stages of penalties. • The tuning weighting factor of the emotional target cost is optimized by try-and-error to find appropriate value.
System Evaluation • Six sentences were synthesized for each emotional state. • The Sentences were from the news websites, usual conversations and the holy Qur’an. • Two major set of evaluation used • Deterministic Evaluation • Perceptual Evaluation • In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance. • Emovoice is trained on the Arabic data using SVM model and its complete feature set.
System Evaluation • In perceptual tests, 15 participant have involved • Two types of listening tests were performed in order to evaluate the system perceptually. • Test intelligibility • The listener was asked to type in what they heard • Word Error Rate (WER) is computed
System Evaluation • Test Naturalness & Emotiveness • The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones. • The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness) • Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.
System Evaluation - Results • The confusion matrix of the classifier output emotion and the target emotional state of the utterance is
System Evaluation - Results • The results of the WER is shown in figure. • It shows a maximum average of 8% in sad emotion WER
System Evaluation - Results • The descriptive statistics of the naturalness and emotiveness ratings is summarized.. Emotiveness Naturalness
System Evaluation - Results • The overall mean of naturalness ratings is 2.7 which approach a good quality naturalness. • The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech • The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.
Conclusions • The main goal of this research was to develop an Emotional Arabic TTS voice. • This research focused on three important emotional sates; normal, sad and questions. • According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.
Future works • It is recommended to increase the duration of acted or real emotional utterances in the RDI Arabic speech database. • However the work done for accurate pitch-marking, some further enhancements are needed especially for question speech utterances. • Optimizing the pitch modification module in festival for better concatenation with different emotions. • Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration
? Questions