Data-driven Parameter Generation for Emotional Speech Synthesis

Data-driven Parameter Generation For Emotional Speech Synthesis Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12th, 2008

Project Goals • Rapid generation of expressive speech using LIMITED training data. • A modular data-driven emotion conversion framework. • Focus on advanced prosody conversion techniques. • Two alternative techniques for F0 conversion • A duration conversion module • Incorporation of basic linguistic context in prosody conversion • Case study with three emotions: surprise, sadness and anger • 272 utterances for training and 28 for testing • Evaluation of individual modules and the combined emotion conversion system

Experimental Setup GMM-based Spectral Conversion OLA Synthesis Pitch-synchronous LPC Analysis Neutral Phone Durations Duration Conversion Linguistic Context (Phone, syllable, word) Duration Tier Converted Waveform Extract new syllable durations TD-PSOLA HMM-based F0 Generation Linguistic Context (Syllable, Word) F0 Segment Selection Converted Waveform F0 Contour TD-PSOLA Neutral F0 Contour Final Waveform

GMM-based Spectral Conversion • Complements prosody conversion modules. • A popular method for voice conversion (Stylianou et al, 1998). • Pitch-synchronous LPC analysis (order=30) and OLA synthesis. • A linear transformation F is learned from the parallel data and applied to each speech frame. • Long-term average spectra of a vowel in the test data: /ae/

F0 Generation from Syllable HMMs • Syllable modelling from scratch based on limited training data. • Linguistic units for F0 modelling • Model initialization with short labels • spos, wpos, lex • Model training with full-context labels • spos, wpos, lex, pofs, ppofs, onset, coda • Decision-tree based parameter tying was performed based on a log-likelihood threshold.

F0, ΔF0, ΔΔF0 Previously used interpolated contours Updated to syllable MSD- HMMs Two spaces: voiced / unvoiced A uv model was introduced to represent all unvoiced regions within syllables. Syllable MSD-HMMs spos@7:wpos@3:lex@0+ pofs@5:ppofs@8:onset@1:coda@1

F0 Generation from Syllable HMMs • Given a sequence of syllable durations and syllable context for an utterance, generate an F0 contour: parameter generation algorithm of the HTS framework. (Tokuda et al., 2000) • Incorporation of global variance in parameter generation only made a slight perceptual difference for surprise. sad surprised angry No GV GV

F0 Segment Selection • Unit selection applied to syllable F0 segments. • Parallel neutral and emotional F0 segments and their common context form the units (U) • An input specification sequence I consisting of syllable context and input contour. • Given a sequence of input syllable specifications, the goal is to find the minimum cost path through trellis of possible F0 segment combinations using Viterbi search.

F0 Segment Selection Continued • Target cost T is a Manhattan distance, consisting of P subcosts: • Two types of target subcosts: • Binary value (0 or 1) indicating context match for a given context feature. • Euclidian distance between input segment and the neutral segment of unit uj • Concatenation cost J? • Zero if adjacent syllables are detached, i.e. separated by an unvoiced region. • Non-zero if syllables are attached, i.e. part of one continous voiced segment. • Distance between last F0 value for unit s-1 and first F0 value of unit s. • Pruning based on segment durations. B Js,s-1 = |B-E| E s - 1 s B

Weights Used in Segment Selection • A separate set of weights wTJ are estimated for attached syllables. The cost of attaching unit uk to a preceding unit uj is defined as follows: • P weights, wT , are estimated automatically corresponding to P target subcosts detached syllables. (P=8)

Conversion of F0 & Spectra

Duration Conversion • A regression tree was built for each broad class and emotion. • Vowels, nasals, glides and fricatives. • Relative trees outperform absolute trees, scaling factors are predicted. • Cross-validation of training data to find the best pruning level. • Seven feature groups (FG) were investigated as predictors in the tree. • FG resulting in the smallest RMS error on test data used for each emotion and broad class. • Optimal RMS errors 25-35ms – not very small… • RMSE of surprise improves with higher level linguistic features

Duration Conversion & F0 Segment Selection Duration Tier

Perceptual Listening Test – Part 1 • Evaluation of spectral conversion through preference tests. • Which one sounds angrier/sadder/more surprised? • Spectral conversion applied to only one in each pair. • Segment selection was used to predict F0 contours for both. • Five pairs of stimuli for each emotion presented to 20 subjects. • Carrier utterances changed after 10 subjects for variety. • Spectral conversion contributes to the perception of anger and sadness • Spectral conversion did not have an effect on the perception of surprise. Due to its less muffled quality, unmodified speech was usually preferred for surprise.

Perceptual Listening Test – Part 2 • Two-way comparison of HMM-based F0 contours with a baseline: • Spectral conversion applied to both stimuli, no duration modification • Five pairs of stimuli for each emotion presented to 30 subjects. • Carrier sentences changed for each group of 10 subjects – 15 unique utterances evaluated.

Perceptual Listening Test – Part 3 • Three-way comparison of HMM-based F0 contours, segment selection and naive baseline • Which one sounds angriest/saddest/most surprised? • Spectral conversion applied to all stimuli, no duration modification • Ten comparisons for each emotion presented to 30 subjects. • In all parts of the test, subjects were asked to identify the emotion they had most difficulty deciding between the options.

Perceptual Listening Test – Part 3

Perceptual Listening Test – Part 4 • Two-way comparison of segment selection with and without duration conversion • Spectral conversion applied to all stimuli • Note that converted durations may effect selected contours due to pruning criteria. • Ten comparisons for each emotion presented to 30 subjects

Perceptual Listening Test – Part 5 • Forced-choice emotion classification task on original emotional utterances. • 15 utterances (5 per emotion) were randomly presented to 30 subjects. • A “Can’t decide” option was available in order to avoid forcing subjects to choose an emotion. • Speaker conveys anger and sadness very clearly. • Surprise is frequently confused with anger. • Speaker used tense voice quality similar to anger, which may have misled people despite clear prosody.

Perceptual Listening Test – Part 5 • Forced-choice emotion classification task on converted utterances. • Thirty utterances (10 for each emotion) were randomly presented to 30 subjects. • Spectral conversion and duration conversion applied to all utterances • Two hidden groups within each emotion (HMM-based contours and segment selection) • “Sounds OK” or “Sounds strange” ? Anger “sometimes” doesn’t sound “mean enough”, sounds “stern” not “angry” yet natural The surprise element is “not always in the right place”. Sounds awkward more than half the time.

Perceptual Listening Test – Part 5 Significant improvement in the recognition of anger with segment selection. (64.7% with syllable HMMs) No loss of intonation quality (75.3% sounds ok with syllable HMMs, 77.3% with segment selection) Prosody, not just voice quality, plays an important role in communicating anger Remaining 10% is probably lost due to spectral smoothing Recognition rates for surprise improves from 60.7% with syllable HMMs to 76.7% - better than perception of original surprised speech. Spectral conversion fails to capture the harsh voice quality for surprise, which was misleading the subjects in original surprised speech (a nice side effect) Intonation quality was much higher with segment selection (from 47.3% to 73.3%) Sadness was captured slightly less consistently with segment selection. Intonation quality was consistent.

Conclusions & Future Research • An emotion conversion system was implemented as a means of parameter generation for expressive speech. • Trained on 15mins of parallel data in three emotions. • Incorporated basic linguistic context in conversion • The subjective evaluation proved that each module is able to help communicate a particular emotion to varying degrees. The overall system was able to communicate all emotions with considerable success. • Segment selection proved to be a highly successful method for F0 conversion even when very little data is available. • Possible areas for future research • Modeling of pause and phrasing • Conversion of perceptual intensity • Context-sensitive conversion of spectra • More advanced concatenation costs for segment selection.

Thank you for your support….

Backup Slides: Model Training • 3 state left-to-right MSD-HMM with 3 mixtures • Two mixture components for the voiced space • A single zero-dimensional mixture for the unvoiced space. • Model training ensures the uv model has a very high weight for the zero-dimensional mixture component • Decision tree-based parameter tying using log likelihood criterion. • Seperate trees were built for each position sentence and each state.

F0 Generation from Syllable HMMs • Objective evaluation is difficult since perceptually correlated measures do not exist. • RMS distance to the true emotional contour is not a reliable source of information. • Speakers have a multitude of strategies for expressing a given emotion. • As a crude measure, it may still help compare general patterns across methods.

Weight Estimation for Detached syllables • P weights, wT , are estimated corresponding to P target subcosts using held-out data. • Least squares framework with X equations P unknowns where X >> P • We already have the target F0 segment we would like to predict. • Find N-best and N-worst candidates in the unit database. • Each error E represents the RMSE error between the target contour and the best and worst units. . . .

Weight Estimation for Attached Syllables • A separate set of weights wTJ are estimated for attached syllables. The cost of attaching unit uk to a preceding unit uj is defined as follows: • The sum of target costs for both units and the concatenation cost of joining them are set equal to the RMS error for the N-best and N-worst pairs.

Weights Used in Segment Selection • Lexical stress and position in word most important linguistic factors across all emotions • Position in sentence most important for surprise. • High weights for previous part of speech for all emotions. • Similarity of input contour to neutral units very importnant for anger, not as important for surprise. • Lexical stress, input cost and concatenation cost are the major contributors to segment selection for attached syllables. • The importance of input cost is once again highest for anger. • Concatenation cost plays an very important role for sadness and surprise.

Data-driven Parameter Generation for Emotional Speech Synthesis