The Creation of Emotional Effects for An Arabic Speech Synthesis System

The Creation of Emotional Effects for An Arabic Speech Synthesis System Prepared by: Waleed Mohamed Azmy Under Supervision: Prof. Dr. Mahmoud Ismail Shoman Dr. SherifMahdyAbdou

Agenda • Motivations & Applications • Emotional Synthesis Approaches • Unit Selection & Blending Data Approach • Arabic Speech Database & Challenges • Festival – Speech Synthesis framework • Arabic Voice Building • Proposed Utterance Structure • Proposed Target Cost Function • System Evaluation & Results • Conclusion & Future Works

Motivations & Applications • Emotions are inseparable components of the natural human speech. • Because of that, the level of human speech can only be achieved with the ability to synthesize emotions. • As the tremendous increasing in the demand of speech synthesis systems, Emotional synthesis become one of the major tends in this field.

Motivations & Applications • Emotional speech synthesis can be applied to different applications like: • Spoken Dialogue Systems • Customer-care centers • Task planning • Tutorial systems • Automated agents • Future trends are towards approaching Artificial Intelligence

Emotional Synthesis Approaches • Attempts to add emotion effects to synthesised speech have existed for more than a decade. • Most emotional synthesis approaches inherent properties of the various normal synthesis techniques used. • Different emotional synthesis techniques provide control over acoustic parameters to very different degrees. • Important emotional related acoustic parameters like: Pitch, Duration, speaking rate …etc

Emotional Synthesis Approaches • Emotional Formant Synthesis • Also known as rule-based synthesis • No human speech recordings are involved at run time. • The resulting speech sounds relatively unnatural and “robot-like”. • Emotional Diphone Concatenation • Use Diphone concatenation on Emotional recordings • A majority reports shows a degree of success in conveying some emotions. • May harm voice quality

Emotional Synthesis Approaches • HMM-Based Parametric synthesis • Use HMM models to change speaking style and emotional expression of the synthetic speech. • Train models on speech produced with different speaking styles. • Requires several hundreds of sentences spoken in a different speaking style. • Emotional-Based Unit-Selection • Use variable length units • Units that are similar to the given target are likely to be selected. • It is the used approach in our proposed system.

Emotional Unit Selection • Units are selected from large databases of natural speech. • The quality of unit selection synthesis depends heavily on the quality of the speaker and on the coverage of the database. • The quality of the synthesis relies on the fact that little or no signal processing is done on the selected units. • Two major cost functions are used • Joint Cost function • Target Cost function

Blending Data Approach • Two main techniques of building emotional voices for unit selection • Tiering techniques • Limited domain synthesis. • Requires a large number of databases for different emotions. • Blending techniques • General purpose application. • All databases are located in one data pool used for selection. • … used technique in our system. • In blending approach, The system will choose the units from only one Database. • Requires careful selection criterion to match target speaking style.

Arabic Speech Database • In our system we used RDI TTS Saudi speaker database. • This database consists of 10 hours of recording with neutral emotion and one hour of recordings for four different emotions that are sadness, happiness, surprise and questioning. • The EGG signal is also recorded to support pitch marking during the synthesis process. • The database is for male speaker sampled with 16 kHZ sampling rate. • HMMs based Viterbi alignment procedure is used to produce the phone level segmentation boundaries

Festival – Speech Synthesis framework • The Festival TTS system was developed at the University of Edinburgh by Alan Black and Paul Taylor. • Festival is primarily a research toolkit for speech synthesis (TTS). • It provides a state-of-the-art unit selection synthesis module called “Multisyn”. • Festival uses a data structure called an utterance structure that consists of items and relations. • An utterance represents some chunk of text that is to be rendered as speech

Festival – Speech Synthesis framework • Example of utterance structure:

Arabic Voice Building • Voice building phase is one of the major steps in building a unit selection synthesizer. • The voice building toolkit that comes with festival has been used for building our Arabic Emotional voice. The following steps were used: • Generate power factors and wave file normalization • Using EGGs for accurate pitch marking extraction • Generating LPC and residuals from wave files • Generate MFCCs, pitch values and spectral coefficients

Arabic Voice Building(Pitch marking) • In unit selection; Accurate estimation of pitchmarks is necessary for pitch modification to assure optimal quality of synthetic speech. • We used EGG signal to extract pitchmarks. • Pitchmarking enhancements has been carried out by matrix optimization process. • The low pass filter and high pass filter cut-off frequencies have been chosen for the optimization process.

Arabic Voice Building(Pitch marking) • Example of the default parameters of the pitchmarking application versus the optimized ones. Default Pitchmarking Optimized Pitchmarking

Proposed Utterance Structure • The HMM-based alignment transcription is converted to the utterance structure. • ASMO-449 is used to transliterate the Arabic text to English characters in the utterance files. • The utterance structures of both the utterances in the training databases and the target utterances is changed to carry emotional information. • A new feature called emotion has been added to the word item type in the utterance structure.

Proposed Utterance Structure • The proposed system is designed to work with three emotional state (normal, sad and question). • So, The emotion feature takes one of three emotional values. • Normal  Normal state • Sad  Sad emotion state • Question  Question emotion sate

Proposed Target Cost Function • The target cost is a weighted sum of functions that check if features in the target utterance match those features in the candidate utterance. • The standard target cost in festival does not contain any computation differences for emotional state of the utterance. • A new emotional target cost is proposed

Proposed Emotional Target Cost Function • The target cost: • The Emotional target cost:

Proposed Target Cost Function • The algorithm in general favors units that are similar in the emotional sate or classified to be emotional in a two stages of penalties. • The tuning weighting factor of the emotional target cost is optimized by try-and-error to find appropriate value.

System Evaluation • Six sentences were synthesized for each emotional state. • The Sentences were from the news websites, usual conversations and the holy Qur’an. • Two major set of evaluation used • Deterministic Evaluation • Perceptual Evaluation • In deterministic evaluation tests the emotional classifier system Emovoice is used across the output utterance. • Emovoice is trained on the Arabic data using SVM model and its complete feature set.

System Evaluation • In perceptual tests, 15 participant have involved • Two types of listening tests were performed in order to evaluate the system perceptually. • Test intelligibility • The listener was asked to type in what they heard • Word Error Rate (WER) is computed

System Evaluation • Test Naturalness & Emotiveness • The actual experiment took place on a computer where subjects had to listen to the synthesized sentences using headphones. • The listeners are asked to rate the quality of the output voice in terms of (Naturalness & Emotiveness) • Participants were asked to give ratings between 1 and 4 for poor, acceptable, good and excellent respectively.

System Evaluation - Results • The confusion matrix of the classifier output emotion and the target emotional state of the utterance is

System Evaluation - Results • The results of the WER is shown in figure. • It shows a maximum average of 8% in sad emotion WER

System Evaluation - Results • The descriptive statistics of the naturalness and emotiveness ratings is summarized.. Emotiveness Naturalness

System Evaluation - Results • The overall mean of naturalness ratings is 2.7 which approach a good quality naturalness. • The average ratings of overall emotiveness are 3.1 which indicate good emotive state of the synthesized speech • The naturalness and emotiveness ratings for question emotion sentences has lower mean value and high variance which means that they were not –to some extent- recognized as natural human speech. However it shows a good emotiveness scores.

Conclusions • The main goal of this research was to develop an Emotional Arabic TTS voice. • This research focused on three important emotional sates; normal, sad and questions. • According to the different tests performed on the system, it shows promising results. At most the participants feel acceptable natural voice with clear good emotive state.

Future works • It is recommended to increase the duration of acted or real emotional utterances in the RDI Arabic speech database. • However the work done for accurate pitch-marking, some further enhancements are needed especially for question speech utterances. • Optimizing the pitch modification module in festival for better concatenation with different emotions. • Use emotional speech conversion based on signal processing and feature modelling technologies. The initial key features are commonly known to be pitch and duration

? Questions

The Creation of Emotional Effects for An Arabic Speech Synthesis System

The Creation of Emotional Effects for An Arabic Speech Synthesis System

Presentation Transcript

Emotional Speech

Emotional Speech

The Emotional Effects of Massage

A Text-to-Speech Synthesis System

Speech synthesis

The Emotional Effects of Retirement

Creation of HMM-based Speech M odel for Estonian Text-to-Speech Synthesis

Emotional Speech

Speech Synthesis

Speech Synthesis

The Arabic Writing System

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

The LIG Arabic / English Speech Translation System at IWSLT07

Perspectives for Articulatory Speech Synthesis

Speech Service Creation

Speech Synthesis

Novel Speech Recognition Models for Arabic

Emotional Speech Synthesis: a comparison of different methods

Numerical Text-to-Speech Synthesis System

4. Speech Synthesis

Novel Speech Recognition Models for Arabic