170 likes | 181 Views
This project aims to model and synthesize emotional speech, specifically focusing on intonation contours and their interaction with emotional states. The research includes reviewing existing work on intonation modelling, training and synthesis, adaptation and perceptual tests, as well as alternative intonation labels and controlling more parameters. The study also involves data collection of emotional speech and explores the possibilities of emotional speech synthesis in the future.
E N D
Emotional Speech Modelling and Synthesis Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young
Agenda • Project Motivation • Review of Work on Intonation Modelling • Intonation Models and Training • Intonation Synthesis from HMMs • Intonation Adaptation and Perceptual Tests • Alternative Intonation Labels • Prosodizer Labels • Lexical Stress Labels • Controlling More Parameters • Pitch Synchronous Harmonic Model • Transplantation of pitch, duration and voice quality • Emotional Speech Data Collection • Summary and Future Direction
Review: Project Motivation • To synthesize or resynthesize speech with desired emotional expressivity. • Initial focus on pitch modelling. • Intonation (F0) contours have two distinct functions: • Convey the prominence structure, sentence modality. • Convey signals about the emotional states. • The interaction between the two functions are largely unexplored. (Banziger & Scherer, 2005) • Goal: • Choose building blocks of intonation • Model them statistically • Adapt models them to different emotions. • Generate intonation contours from models.
Review: Intonation Modelling • Basic Models • Seven basic models: A (accent), U (unstressed), RB (rising boundary), FB (falling boundary), ARB, AFB, SIL • 3 state, single mixture, left-to-right HMMs • Data:Boston Radio Corpus. (48 minutes of speech, female speaker) • Features:Mean-normalized raw f0 and energy values as well as differentials. • Context-Sensitive Models • Tri-unit models (U+A-RB) • Full-context models (U+A-RB::vowel_pos=2::num_a=1::…..) • Decision tree-based parameter tying was performed for context-sensitive models.
Review: Generation from Models • The goal is to generate an optimal sequence of F0 values directly from syllable HMMs given the intonation models: • This results in a sequence of mean state values. • Cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) • Differential F0 features are used as constraints in contour generation. Results in smoother contours.
Review: Generation From Models a a u u u u I saw him yes ter day a a u a u fb I saw him yes ter day
Review: Model Adaptation with MLLR • Adapt models with Maximum Likelihood Linear Regression (MLLR) . • Adaptation data from Emotional Prosody Corpus which consists of four syllable phrases in a variety of emotions. • Happy and sad speech were chosen for this experiment.
Review: Perceptual Tests • Utterances with sad contours were identified 80% of the time. This was significant. (p<0.01) • Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. (Smiling voice is infamous in literature) • Happy models worked better with utterances with more accents and rising boundaries - the organization of labels matters!!!
Alternative Intonation Labels • Manual intonation labels are subjective and their creation time-consuming. • Evaluate alternative labelling methods • Automatic TOBI labels generated by Prosodizer. Prosodizer generated labels are converted to the seven basic units. • Lexical Stress Labels one (primary stress), two (secondary stress), zero (no-stress), sil (silence) • Evaluation on Boston Radio Corpus female speaker f2b
Perceptual Investigation of Other Emotions sil-one+one ..one.. ..zero.. zero-one+sil Boredom Contempt Disgust Interest Cold Anger Panic Hot Anger
Controlling More Parameters • Pitch Synchronous Harmonic Model (Hui Ye, 2004) • Size of analysis/synthesis window equal to one pitch period. • Represent each frame as a sum of harmonically related sinusoids. (amplitudes and phases) • For voiced frames, acquire LSF representations of the vocal tract. • Better framework to manipulate pitch, duration and voice quality. • Implemented pitch, duration and voice quality transplantation. • Set up framework for emotion conversion. (prosody, duration and vocal tract)
Transplantation with PSHM • Duration Transplantation • Pitch Transplantation per phone • For each phone: • Compute pitch alignment • Recompute spectral envelope • Restore time
Transplantation with PSHM • Vocal Tract Transplantation • Alignment of frames based on DTW of MFCC distance. • Convert LSF parameters to LPC. • Filtering of the source harmonics with the target LPC. • Computation of new sinusoidal amplitudes. Neutral Spectral Envelopes for /eh/ Happy
Transplantation Results Conversion of voice quality improves target emotion perception in all transplantations. LSF transplantation driving factor in anger, while both LSF and prosody transplantation plays an important role in happy and sad.
Emotional Speech Data Collection • 4 emotions: Happy, sad, surprised, angry. • Two speakers: 1 male & 1 female. (Suzanne Park, Matthew Johnson) • Toshiba TTS Training Corpus. • Happy & Sad: 1250 sentences • 900 from the phonetically balanced short sentences. • 300 long sentences. • 25 questions & 25 exclamations. • Surprise & Anger: 625 sentences. • 300 phonetically balanced short sentences. • 300 long sentences • 25 questions • Neutral data collection for the male speaker. (1250)
Emotional Speech Data Collection • Emotion elicitation by context prompting. “I like a party with an atmosphere” Happy: You have just arrived at the best party in town. Sad: You never get invitations to good parties any more. • Expected recording time • 12 days for the female speaker. ( 2 weeks and a half non-stop) • 15 days for the male speaker. (Twice a week for two months) • 6-hour days • Post processing: • Phonetic Alignment -Syllable Boundaries • Pitch Marks -Prosodizer labels • Text Analysis
Future Direction • Data collection and labelling. • Experiments with emotion conversion • Prosody conversion based on HMM models • Voice quality conversion. • Joint modelling of prosody and voice quality in emotional speech. • Investigation of voice source and its effects on emotion. • Integration of speech modification techniques into a TTS framework. • Comparison of speech modification techniques with unit-selection techniques.