190 likes | 199 Views
Learn about the integration of emotion in Arabic Text-to-Speech system, including prosody analysis, rule extraction, and synthesis techniques. Discover how prosody enhances speech intelligibility and expressiveness.
E N D
Prosodic Feature Introduction and Emotion Incorporation in an Arabic TTS Presented by Dr. O. Al Dakkak Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U.
Outline • Arabic TTS • Why Prosody generation? • Prosody Analysis and Rule Extraction • Emotion Inclusion • Results • Conclusion
Arabic Text-to-Speech System • Arabic Text-to-Phonemes (ATOPH) Including open /E/, /O/ phonemes and emphatic vowels • Use of MBROLA Diphone units to synthesize speech Till our semi-syllables are ready (Corpus is currently being recorded) • Prosody Generation and Emotion Inclusion
Arabic Text-to-Speech System • MBROLA permits to synthesize phonemes. With control on duration and F0 contour (a set of segments) and we implemented a tool to control the Amplitude. • Absent phonemes are replaced by the nearest present phonemes • Possibility to generate and test prosody
Why Prosody Generation? • Increase intelligibility & expressionality. • Provides the context in which speech is interpreted • Signals speaker intentions (special aids) • Man-machine communication (airports,..) • Doublage*
Methodology • Based on the punctuation marks (‘,’, ‘.’, ‘?’ and ‘!’) we classify sentences into: continuous affirmation, long affirmation, interrogative, exclamation; respectively. • Recording a corpus and Analysis of its sentences to produce F0, and intensity curves • Statistical study of the curves and Rule extraction to generate them automatically.
The corpus • Use of a pre-recorded corpus, of 12 short sentences for each type, 5 speakers (4 m. & 1 f.). Each sentence has 14 phonemes at most. • Recording of other 10 sentences of variable lengths pronounced by 3 speakers. • short : 4-20 phonemes, • medium : 20-40 phonemes • long : more than 40 phonemes. • The curves of F0, intensity were available for the pre-recorded corpus and were computed for the further set of recording.
Rules Extraction • Re-definition of the length concept, using fuzzy sets:
Rules Extraction • Curve stylization after stochastic analysis, ex:
Emotion Inclusion • Recording a corpus of 5 different emotional sentences (joy, anger, sadness, fear & surprise) with their emotionless versions (20 sentences/emotion). • Measures of prosodic features F0, duration and intensity, with their variations (Praat). • Extraction of rules to automatically produce emotion on synthetic speech. • Rules Validation.
أَهُوَ ذَنْبِي أَنْ أَتَحَمَّلَ أَنَا ذلك؟ Is it my fault to bear it? Jitter: Irregularities between successive glottal pulses Range: difference between F0max & F0min F0 Averag: Mean value Pitch: variation of F0 Variability: deg. Of it (high, low..) . Contour slope: shape of contour slope (range variation).
Example: Anger emotion • F0 mean: + 40%-75% • F0 range: + 50%-100% • F0 at vowels and semi-vowels: + 30% • F0 slope: + • Speech rate: + • Silence rate: - • Duration of vowels and semi-vowels: + • Intensity mean: + • Intensity monotonous with F0 • Others: F0 variability: +, F0 jitter: +
Analysis & Rule Extraction: Anger emotionless With emotion
Emotion Synthesis: Anger • F0 mean: + 30% • F0 range: + 30% • F0 at vowels and semi-vowels: +100% • Speech rate: +75%-80% • Duration of vowels and semi-vowels: +30% • Duration of fricatives: +20%
Synthetic examples emotionless with emotion • Anger: • Joy: • Sadness : • Fear: • Surprise: “who do you think you are?” “no more clouds in the sky” “I’m so sad today” “What a scary scene!” “What a beautiful scene!”
EmoGen Interface Text Editor Voice Input Text Speech and emotion properties Mbrola Player interface Normal text to MBROLA text Converter (NTMTC) Prosody Generator Emotion Generator
Results • Five sentences for each emotion were synthesized and listened by 10 people. • Each listener gives the perceived emotion for each sentence (we don’t provide our list of emotions)
Conclusion • An automated tool for emotional Arabic synthesis has been developed • The prosodic model proposed and tested in this work proved to be successful. Especially in conversational context: • Further work will follow to include other emotions: Disgust, Annoyance,…