200 likes | 300 Views
A Statistical Approach To Emotional Prosody Generation. Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young. Agenda. Previous Toshiba Update A Review of Emotional Speech Synthesis Motivation for Proposed Approach
E N D
A Statistical Approach To Emotional Prosody Generation Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young
Agenda • Previous Toshiba Update • A Review of Emotional Speech Synthesis • Motivation for Proposed Approach • Proposed Approach: Intonation Generation from Syllable HMMs. • Intonation Models and Training • Recognition Performance of Intonation Units • Intonation Synthesis from HMMs • MLLR based Intonation Adaptation • Perceptual Tests • Summary and Future Direction
Previous Toshiba Update: A Brief Review • Emotion Recognition • Demonstrated work on HMM-based emotion detection in voicemail messages. (Emotive Alert) • Reported the set of acoustic features that maximize classification accuracy for each emotion type identified using sequential forward floating algorithm. • Expressive Speech Synthesis • Demonstrated the importance of prosody in emotional expression through copy-synthesis of emotional prosody onto neutral utterances. • Suggested the linguistically descriptive intonation units for prosody modelling. (accents, boundary tones)
A Review of Emotional Synthesis • The importance of prosody in emotional expression has been confirmed. (Banse& Scherer, 1996; Mozziconacci, 1998) • The available prosody rules are mainly defined for global parameters. (mean pitch, pitch range, speaking rate, declination) • Interaction of linguistic units and emotion is largely untested. (Banziger, 2005) • Strategies for emotional synthesis vary based on the type of synthesizer. • Formant Synthesis allows control over various segmental and prosodic parameter Emotional prosody rules extracted from literature are applied by modifying neutral synthesizer parameters. (Cahn, 1990; Burkhardt, 2000; Murray&Arnott, 1995) • Diphone Synthesis allows prosody control by defining target contours and durations based on emotional prosody rules. (Schroeder, 2004; Burkhardt, 2005) • Unit-Selection Synthesis provides minimal parametric flexibility. Attempts at emotional expression involve recording entire unit databases for each emotion and selecting units from the appropriate database at run time. (Iida et al, 2003) • HMM Synthesis allows spectral and prosodic control at the segmental level. Provides statistical framework for modelling emotions. (Tsuzuki et al, 2004)
A Review of Emotional Synthesis Unit-Selection Synthesis + Very good quality - Not scalable, too much effort Unit Replication HMM Synthesis +Statistical -Too granular for prosody modelling Unexplored METHOD Statistical • Formant Synthesis / Diphone Synthesis • - Only as good as hand-crafted rules • Poor to medium baseline quality Rule-Based Segmental Intonational (syllable/phrase) Global GRANULARITY
Motivation For Proposed Approach1 • We propose a generative model of prosody. • We envision evaluating this prosodic model in a variety of synthesis contexts through signal manipulation schemes such as TD-PSOLA. • Statistical • Rule based systems are only as good as their hand-crafted rules. Why not learn rules from data? • Success of HMM methods in speech synthesis. • Syllable-based • Pitch movements are most relevant on the syllable or intonational phrase level. However, the effects of emotion on contour shapes and linguistic units are largely unexplored. • Linguistic Units of Intonation • Coupling of emotion and linguistic phenomena has not been investigated. 1This work will be published in the Proceedings of ACII, October 2005, Beijing
Overview Context Sensitive HMMs Emotion HMMs Neutral Speech Data Training MLLR Mean Pitch Emotion Data Syllable Boundaries F0 Generation Syllable Labels 1 1.5 c 1.5 1.9 a 1.9 2.3 c 2.3 2.5 rb … TD-PSOLA Phonetic Labels Synthesized Contour Current focus is on pitch modelling only. Step 2:Generate intonation contours from HMM Step 3:Adapt models given a small amount of emotion data Step 4: Transplant contour onto an utterance. Syllable-based intonation models. Step 1:Train intonation models on neutral data
Intonation Models and Training • Basic Models • Seven basic models: A (accent), C (unstressed), RB (rising boundary), FB (falling boundary), ARB, AFB, SIL • Context-Sensitive models • Tri-unit models (Preceding and following intonation unit) • Full-context models (Position of syllable in intonational phrase, forward counts of accents, boundary tones in IP position of vowel in syllable, number of phones in the syllable) • Decision tree-based parameter tying was performed for context-sensitive models. • Data:Boston Radio Corpus. • Features:Normalized raw f0 and energy values as well as differentials.
Recognition Results • Evaluation of models was performed in a recognition framework to assess how well the models represent intonation units and to quantify the benefits of incorporating context. • A held-out test set was used for predicting intonation sequences • Basic models were tested with a varying numbers of mixture components. This was compared with accuracy rates of full-context models.
Intonation Synthesis From HMM • The goal is to generate an optimal sequence of observations directly from syllable HMMs given the intonation models: • The optimal state sequence is predetermined by basic duration models. So parameter generation problem becomes • The solution is the sequence of mean vectors for the state sequence Qmax • We used the cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) • Differential F0 features (Δf and ΔΔf) are used as constraints in contour generation. Maximization is done for static parameters only.
Intonation Synthesis From HMM • A single observation vector consists of static and dynamic features: • The relationship between the static and dynamic features are as follows: • This relationship can be expressed in matrix form where O is the sequence of full feature vectors and F is the sequence of static features only. W is the matrix form of window functions. The maximization problem then becomes: • The solution is a set of equations that can be solved in a time recursive manner. (Tokuda et al, 1995)
Intonation Synthesis From HMM Context Sensitive HMMs Emotion HMMs Neutral Speech Data Training MLLR Mean Pitch Emotion Data Syllable Boundaries F0 Generation Syllable Labels 1 1.5 c 1.5 1.9 a 1.9 2.3 c 2.3 2.5 rb … Synthesized Contour Phonetic Labels
Perceptual Effects of Intonation Units a a c c c c a a c a c fb
Pitch Contour Samples Generated Neutral Contours Transplanted on Unseen Utterances original synthesized original synthesized tri-unit full-context
MLLR Adaptation to Emotional Speech 1 • Maximum Likelihood Linear Regression (MLLR) adaptation computes a set of linear transformations for the mean and variance parameters of a continuous HMM. • The number of transforms are based on a regression tree and a threshold for what is considered “enough” adaptation data. • Adaptation data from Emotional Prosody Corpus which consists of four syllable phrases in a variety of emotions. • Happy and sad speech were chosen for this experiment. 3 2 4 5 6 7 Not Enough Data: Use transformation From parent node.
MLLR Adaptation To Happy & Sad Data c c c c c arb
Perceptual Tests Test 1: How natural are neutral contours? • Ten listeners were asked to rate utterances in terms of naturalness of intonation. • Some utterances were unmodified and others had synthetic contours. • A t-test (p<0.05) on the data showed that distributions of ratings for the two hidden groups overlap sufficiently, i.e. there is no significant difference in terms of quality.
Perceptual Tests Test 2: Does adaptation work? • The goal is to find out if adapted models produce contours that people perceive to be more emotional than the neutral contours. • Given pairs of utterances, 14 listeners were asked to identify the happier/sadder one.
Perceptual Tests • Utterances with sad contours were identified 80% of the time. This was significant. (p<0.01) • Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. (Smiling voice is infamous in literature) • Happy models worked better with utterances with more accents and rising boundaries - the organization of labels matters!!!
Summary and Future Direction • A statistical approach to prosody generation was proposed with an initial focus on F0 contours. • The results of the perceptual tests were encouraging and yielded guidelines for future direction: • Bypass the use of perceptual labels. Use lexical stress information as a prior in automatic labelling of corpora. • Investigate the role of emotion on accent frequency to come up with a “Language Model” of emotion. • Duration Modelling: Evaluate HSMM framework as well as duration adaptation by using vowel specific conversion functions. • Voice Source Modelling: Treat LF parameters as part of prosody. • Investigate the use of graphical models for allowing hierarchical constraints on generated parameters. • Incorporate the framework into one or more TTS systems.