1 / 19

Prosodic Feature Introduction for Emotion Incorporation in Arabic TTS

Learn about the integration of emotion in Arabic Text-to-Speech system, including prosody analysis, rule extraction, and synthesis techniques. Discover how prosody enhances speech intelligibility and expressiveness.

dthaler
Download Presentation

Prosodic Feature Introduction for Emotion Incorporation in Arabic TTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prosodic Feature Introduction and Emotion Incorporation in an Arabic TTS Presented by Dr. O. Al Dakkak Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U.

  2. Outline • Arabic TTS • Why Prosody generation? • Prosody Analysis and Rule Extraction • Emotion Inclusion • Results • Conclusion

  3. Arabic Text-to-Speech System • Arabic Text-to-Phonemes (ATOPH) Including open /E/, /O/ phonemes and emphatic vowels • Use of MBROLA Diphone units to synthesize speech Till our semi-syllables are ready (Corpus is currently being recorded) • Prosody Generation and Emotion Inclusion

  4. Arabic Text-to-Speech System • MBROLA permits to synthesize phonemes. With control on duration and F0 contour (a set of segments) and we implemented a tool to control the Amplitude. • Absent phonemes are replaced by the nearest present phonemes • Possibility to generate and test prosody

  5. Why Prosody Generation? • Increase intelligibility & expressionality. • Provides the context in which speech is interpreted • Signals speaker intentions (special aids) • Man-machine communication (airports,..) • Doublage*

  6. Methodology • Based on the punctuation marks (‘,’, ‘.’, ‘?’ and ‘!’) we classify sentences into: continuous affirmation, long affirmation, interrogative, exclamation; respectively. • Recording a corpus and Analysis of its sentences to produce F0, and intensity curves • Statistical study of the curves and Rule extraction to generate them automatically.

  7. The corpus • Use of a pre-recorded corpus, of 12 short sentences for each type, 5 speakers (4 m. & 1 f.). Each sentence has 14 phonemes at most. • Recording of other 10 sentences of variable lengths pronounced by 3 speakers. • short : 4-20 phonemes, • medium : 20-40 phonemes • long : more than 40 phonemes. • The curves of F0, intensity were available for the pre-recorded corpus and were computed for the further set of recording.

  8. Rules Extraction • Re-definition of the length concept, using fuzzy sets:

  9. Rules Extraction • Curve stylization after stochastic analysis, ex:

  10. Emotion Inclusion • Recording a corpus of 5 different emotional sentences (joy, anger, sadness, fear & surprise) with their emotionless versions (20 sentences/emotion). • Measures of prosodic features F0, duration and intensity, with their variations (Praat). • Extraction of rules to automatically produce emotion on synthetic speech. • Rules Validation.

  11. أَهُوَ ذَنْبِي أَنْ أَتَحَمَّلَ أَنَا ذلك؟ Is it my fault to bear it? Jitter: Irregularities between successive glottal pulses Range: difference between F0max & F0min F0 Averag: Mean value Pitch: variation of F0 Variability: deg. Of it (high, low..) . Contour slope: shape of contour slope (range variation).

  12. Example: Anger emotion • F0 mean: + 40%-75% • F0 range: + 50%-100% • F0 at vowels and semi-vowels: + 30% • F0 slope: + • Speech rate: + • Silence rate: - • Duration of vowels and semi-vowels: + • Intensity mean: + • Intensity monotonous with F0 • Others: F0 variability: +, F0 jitter: +

  13. Analysis & Rule Extraction: Anger emotionless With emotion

  14. Emotion Synthesis: Anger • F0 mean: + 30% • F0 range: + 30% • F0 at vowels and semi-vowels: +100% • Speech rate: +75%-80% • Duration of vowels and semi-vowels: +30% • Duration of fricatives: +20%

  15. Synthetic examples emotionless with emotion • Anger: • Joy: • Sadness : • Fear: • Surprise: “who do you think you are?” “no more clouds in the sky” “I’m so sad today” “What a scary scene!” “What a beautiful scene!”

  16. EmoGen Interface Text Editor Voice Input Text Speech and emotion properties Mbrola Player interface Normal text to MBROLA text Converter (NTMTC) Prosody Generator Emotion Generator

  17. Results • Five sentences for each emotion were synthesized and listened by 10 people. • Each listener gives the perceived emotion for each sentence (we don’t provide our list of emotions)

  18. Results

  19. Conclusion • An automated tool for emotional Arabic synthesis has been developed • The prosodic model proposed and tested in this work proved to be successful. Especially in conversational context: • Further work will follow to include other emotions: Disgust, Annoyance,…

More Related