A Framework of Emotive Text-to-Speech (TTS) Synthesis Using a Diphone Synthesizer

A Framework of Emotive Text-to-Speech (TTS) Synthesis Using a Diphone Synthesizer Hao Tang Supervisor: Prof.Thomas S. Huang Presented to the Qualifying Examination Committee

A multidisciplinary study: an emotive audio-visual avatar A neutral avatar GOAL: make the avatar emotive! Co-Principal Investigators: • Professor Thomas Huang • Electrical and Computer Engineering • Professor Mark Hasegawa-Johnson • Electrical and Computer Engineering • Professor Charissa Lansing • Speech & Hearing Science • Professor Jesse Spencer-Smith • Psychology

Outline • Introduction • Brief literature review • General framework of emotive TTS synthesis • A rule-based emotive TTS system • Evaluation • Summary

What is the problem to be addressed? • The problem concerns the process of building an emotive text-to-speech (TTS) synthesis system using a diphone synthesizer • Key issues • Framework, approach • Prosody transformation

Evolution of TTS technologies Excellent Intelligibility; Good Naturalness Poor Intelligibility; Poor Naturalness Good Intelligibility; Poor Naturalness Formant Synthesis Diphone Synthesis Unit Selection Synthesis Bell Labs; CNET; Bellcore; Berkeley Speech Technology ATR in Japan; CSTR in Scotland; BT in England; AT&T Labs; L&H in Belgium Bell Labs; Joint Speech Research Unit; MIT (Klatt-Talk); Haskins Lab 1962 1967 1972 1977 1982 1987 1992 1997 Year

Why is emotion important for TTS? • Unsatisfactory naturalness is largely due to a lack of emotional affect in current state-of-the-art TTS systems • To improve naturalness, work must be performed on the simulation of emotions in synthetic speech

How do we model emotions in synthetic speech? • The most intuitive approach would be to model physiological effects of emotions on the human vocal tract • Why not to do it? • Most speech synthesizers do not model physical movements of the vocal tract • Instead, they create synthetic speech by acoustic modeling of speech or concatenation of speech units • Both ways rely heavily on the adjustment of a number of speech parameters • Parameter adjustment-based approach • There have long been debates as to what speech parameters are essential for expressing emotions in speech, but…

Speech parameters • Prosody • Fundamental frequency f0 • Duration • Energy • Voice quality • Modal, breathy, whispery, creaky, harsh, etc. Different speech synthesis techniques provide control over these parameters to different degrees

Formant synthesis • Based on a source-filter model of the formants produced in the vocal tract • Create speech entirely through the rules on acoustic correlates of various speech sounds (rule-based synthesis) • Cons: • imperfect rules •  robot-like sound • Pros: • Small footprint • Many degrees of freedom Janet Cahn (1989) Angry Sad Afraid Examples: Iain Murray (1989) Angry Sad Afraid Felix Burkardt (2000) Angry Sad Afraid

Diphone synthesis • Create synthetic speech by concatenating small units of recorded speech • Diphone: from middle of one phone to the middle of the next phone • Recorded in a monotonic pitch • Prosody forced-matching is performed to ensure desired intonation and rhythm • Pros: • Small footprint • High-quality speech • Can control prosody • Cons: • Artifacts (spectral distortion) • Difficult to control voice quality Examples: Marc Schroder (2004) Angry Joyful Sad Afraid

Unit-selection synthesis • Create synthetic speech by concatenating non-uniform speech segments from a large corpus • Unit selection is based on minimizing a cost function: weighted sum of target and link costs • Maintain natural prosody preserved in the database units • Cons: • Large footprint • Good quality not guaranteed for every utterance • Difficult to control prosody and voice quality • Pros: • “Playback” quality is achievable, especially in limited domains Akemi Iida (2002) Angry Joyful Sad Examples: Gregor Hofer (2004) Angry Joyful Ellen Eide (IBM, 2004) Good news Bad news

Quality vs. Freedom Good Intelligibility; Human Quality Naturalness (Limited Domains) Poor Intelligibility; Poor Naturalness Good Intelligibility; Poor Naturalness Formant Synthesis Diphone Synthesis Unit Selection Synthesis High Medium Low Degree of freedom for parameter control 1962 1967 1972 1977 1982 1987 1992 1997 Year

Why do I use a diphone synthesizer? • Small footprint • Ideal for low-cost implementation on resource-sensitive devices • Can control prosody • Easy generalization to any emotions • The lack of control of voice quality can be partially remedied

Previous attempt that used a diphone synthesizer (Marc Schroder’s work) • Represent emotional states in a 2-dimensional space • Find the rules that map each point in the space onto its acoustic correlates by analysis of a database of spontaneous emotions • The acoustic variables are related to global prosody settings • Fundamental frequency • Tempo • Intensity Lose local effects Lose interaction with linguistic structue

Functional diagram of TTS using a diphone synthesizer NLP DSP Phonemes Prosody Text Text Analysis Prosody Prediction Prosody Matching Spectral Smoothing Speech Lexicons Rules Models Diphone Databases

NLP Emotion Filter DSP Text Text Analysis Prosody Prediction Prosody Transformation Voice Quality Transformation Prosody Matching Spectral Smoothing Speech Lexicons Rules Models Diphone Databases Rules Models A general framework of emotive TTS synthesis using a diphone synthesizer Phonemes Phonemes Neutral prosody Emotive prosody

A difference approach • The rules or models learn and predict the parameter differences between an emotional state and the neutral state : Neutral prosodic parameters : Emotive prosodic parameters : Parameter differences

A difference approach • The rules or models learn and predict the parameter differences between an emotional state and the neutral state : Predicted parameter differences : Predicted neutral prosodic parameters : Predicted emotive prosodic parameters

Prosody transformation by rules • By prosody manipulation rules • Record an actor with several emotions • Measure the differences of prosodic parameters • Pitch level, range, and variability • Pitch contour shape • Speaking rate • Challenges • How to deal with local effects • How to use the information from the linguistic structure

Prosody transformation by statistical models • By statistical prosody prediction models • F0 model • Duration model • Energy model • Models are trained with speech databases • Capture local effects • Incorporate linguistic context Feature vector Difference (f0, duration, energy) Statistical model

F0 model • Features should represent a syllable’s linguistic context • Lexical stress of the current syllable • Position of the current syllable in the current word • Focus stress of the current word • Position of the current word in the current phrase • Part of speech (POS) of the current word • ... • Take into account the influence of neighboring syllables • Use a context window • Features contain plenty of nominal (non-metric) data • Need a tool to efficiently learn categories using nominal data Feature vector F0 difference F0 model

NL Leaf NR CART: classification and regression tree • “Impurity” of a node N: Split: performs a property test N i(N) = 0 if the node is “pure” i(N) is large if uniformly distributed • Split criterion: maximize impurity reduction • A leaf node corresponds to a decision • A leaf node’s impurity might be positive • A leaf node’s class label is usually assigned by majority vote • In our case, each leaf node stores the mean of the samples in that node (differences of syllabic f0 contours)

Duration model • Features include • Identity of the current phoneme • Voicing of the current phoneme • Sonority class of the current phoneme • Position of the current phoneme in the current syllable • Position of the current syllable in the current word • Lexical stress of the current syllable • Focus stress of the current word • Position of the current word in the current phrase • Part of speech (POS) of the current word • ... • Form the feature vector using a context window • Store the mean of the phonemic durations in leaf nodes Feature vector Duration difference Duration model

Prosody prediction models (training) Neutral Speech DB Model input Text-based features (span a context window) CART model Emotional Speech DB Difference of prosodic parameters D(p) = p(emotive)-p(neutral) Text analysis Forced alignment Speech analysis Model output

Prosody prediction models (prediction) Difference model Text analysis Text-based features D(p) Neutral “full” model p+D(p) Sentence to be synthesized p Emotive prosody

What I have done • Presented a general framework of emotive TTS synthesis using a diphone synthesizer • Developed a rule-based emotive TTS system based on the Festival-MBROLA architecture • Designed and conducted subjective listening experiments to evaluate the expressiveness of the system

NLP Emotion Filter DSP Text Text Analysis Prosody Prediction Prosody Transformation Voice Quality Transformation Prosody Matching Spectral Smoothing Speech Lexicons Rules Models Diphone Databases Rules Models A general framework of emotive TTS synthesis using a diphone synthesizer Phonemes Phonemes Neutral prosody Emotive prosody

Rules Neutral prosody Emotive prosody A rule-based emotive TTS system DSP NLP Festival Emotion filter MBROLA Text Emotive Speech • Hierarchical Syntax tree • Phonemes • F0 contours • Durations • Phonemes • F0 contours • Durations

A rule-based emotive TTS system • Pitch transformation • F0 mean • F0 range • F0 variability • F0 contour shape (phrase and syllable levels) • Duration transformation • Whole phrase (overall speaking rate) • Syllables of different stress types • Phonemes of different sonority classes • Voice quality transformation • Jitter simulation DSP NLP Festival Emotion filter MBROLA Text Emotive Speech

A rule-based emotive TTS system DSP NLP Festival Emotion filter MBROLA Text Emotive Speech Angry Neutral Happy Afraid Eight basic emotions: Bored Joyful Sad Yawning

Subjective listening experiments • Experiment design • 11 subjects • 4 experiments, in which each subject was provided with • eight speech files, synthesized using the same semantically neutral sentence (experiment 1) • eight speech files, each synthesized using a semantically meaningful sentence appropriate for the particular emotion (experiment 2) • eight speech files of experiment 1, plus consistent and synchronized facial expressions of an emotive avatar (experiment 3) • eight speech files of experiment 2, plus consistent and synchronized facial expressions of an emotive avatar (experiment 4) • The subjects were asked to determine the emotion of each speech file as they would perceive • The subjects rated their choices with a confidence score (1-5)

Exp. 1: speech only; Exp. 2: speech + verbal content; Exp. 3: speech + facial expr; Exp. 4: speech + verbal content + facial expr Results • A certain degree of expressiveness is achieved by the system • “happy” and “joyful”, “angry” and “afraid”, “bored” and “yawning” are often mistaken for each other • Perception of emotions is improved if speech is complemented by other channels • Facial expressions seem to dominate on experiment 3 and 4 R – recognition rate; S – average confidence score

Summary • Presented a general framework of Emotive TTS synthesis using a diphone synthesizer • Developed a rule-based emotive TTS system based on the Festival-MBROLA architecture, capable of synthesizing eight basic emotions • Subjective listening experiments indicate that a certain degree of expressiveness is achieved by the system • Potential improvements to the system can be made by using statistical prosody prediction models instead of prosody manipulation rules, as I have elaborated in the framework

A multidisciplinary study: An emotive audio-visual avatar happy sad Co-Principal Investigators: • Professor Thomas Huang • Electrical and Computer Engineering • Professor Mark Hasegawa-Johnson • Electrical and Computer Engineering • Professor Charissa Lansing • Speech & Hearing Science • Professor Jesse Spencer-Smith • Psychology

A Framework of Emotive Text-to-Speech (TTS) Synthesis Using a Diphone Synthesizer