HMM-Based Synthesis of Creaky Voice

HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl

Creaky voice • Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration • Highly irregular with secondary laryngeal excitations

Use of creaky voice • Usually involuntary, but various systematic usages have been reported • For instance, creaky voice has been observed as • phrase boundary marker • turn-yielding mechanism • indication of hesitations • portrayal of social status • cue for communicating attitude and affective states

Synthesis of creaky voice • HMM-based synthesis of creaky voice requires • Algorithm for automatic detection of creaky voice • Accurate f0 estimation and voicing decision • Prediction of creaky voice from context (text input) • Vocoder capable of rendering creaky excitation

This work… Compares different f0 estimation methods suitable for building creaky voice synthesis Culminates the previous research by creating a framework for creaky voice synthesis Explores the conversion of normal synthetic voice to a creaky one

What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?

Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

A) Use a database of creaky voice

Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

B) Replace f0 estimation method with one suitable for creaky voice

f0 estimation of creaky voice • Creaky voice has low f0 and irregular excitation • Many f0 trackers output spurious values or classify creak as unvoiced • Range of state-of-the-art f0 estimation algorithms were evaluated with creaky voice: • GlottHMM • SWIPE (with SPTK 3.6 voicing decision) • RAPT (SPTK 3.6) • SPTK 3.1 cepstrum based pitch function • STRAIGHT TEMPO

f0 estimation of creaky voice – Evaluation • Methods were mostly used with default settings • Frame length was set to 45ms whenever possible • Speech data: • 3databases of read speech for TTS development • American English male BDL • Finnish male MV • Finnish female HS • Conversational speech data from 7 other speakers (Swedish, Japanese, American English)

f0 estimation of creaky voice – Results • GlottHMM [1] performed best with TTS data • SPTK performed best with conversational speech • For creaky voice TTS development, GlottHMM f0 estimation was chosen • [1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011

What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?

C) Detect creaky regions and model creak as a special case

Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

Creaky voice detection • Hand-annotation too laborious  automatic methods • An automatic creaky voice detection method by Kane & Drugman [1,2] • Based on linear prediction (LP) residual features • [1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012 • [2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech & Language, 2013

Probability of creak LP residual

Modeling creaky excitation • Extensionof the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice • [1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012 • [2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.

Deterministic component Envelope of the stochastic component Main excitation Secondary excitation GCI GCI GCI GCI GCI GCI

Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

Voice building and synthesis • Training: • Standard HTS method with the addition of 1-dimensional stream of creaky probability • Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs) • Synthesis: • Excitation: DSM vocoder with creaky parts rendered with the creaky excitation • Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter

Evaluation • The following systems were compared • Conventional (STRAIGHT f0) • Proposed (GlottHMM f0) • Proposed (GlottHMM f0 and creaky excitation) • Subjective online listening tests • Stimuli: 20 sentences from the held-out data of BDL and MV • 29 tests subjects

Evaluation – MOS naturalness • Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1 • Difference between systems 2 and 3 is not significant • Conclusions: • Use of GlottHMM f0 improves naturalness • Modeling of creaky excitation has no effect on MOS STRAIGHT f0 GlottHMM f0 GlottHMM f0 + creaky excitation

Evaluation – Creaky rendering • Pairwise comparison of samples • Systems 2 and 3 are preferred over system 1 • System 3 is preferred over system 2 • Conclusions: • Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering No pref. No pref. No pref. 2 GlottHMM f0 3 3 GlottHMM f0 + cr. exc. GlottHMM f0 + cr. exc. 2 GlottHMM f0 1 1 STRAIGHT f0 STRAIGHT f0

Is it possible to transplant a creaky voice quality to a non-creaky speaker?

Adding creak for non-creaky speaker • Convert non-creaky voice of Scottish English male AWB to creaky • Transplantation strategy: • Creaky voice is predicted from American English male BDL • Creaky excitation pulse from BDL is used to render creak • f0 is either: • kept as is • substituted with BDL f0 by stream substitution • transformed only in the creaky parts

Evaluation • Four different voices were built: • AWB (baseline) • AWBwith BDL creaky excitation • AWBwith BDL creaky excitation andBDL f0 • AWBwith BDL creaky excitation and f0 transformation

Evaluation • Subjective online listening tests • 14 tests subjects • 28 synthesized stimuli • Samples were rated with two scales: • Standard MOS naturalness • Impression of creakiness from 1 to 5 • 1 – does not sound like creaky voice • 2 – • 3 – • 4 – • 5 – sounds exactly like creaky voice

Evaluation results – MOS • System 3 is rated lower than system 1 • No other statistically significant differences • Conclusions • Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used • Degradation of system 3 is probably due to different prosody AWB + f0 transformation + creaky excitation Baseline AWB AWB + creaky excitation AWB + BDL f0 stream + creaky excitation

Evaluation results – Creakiness • System 1 is rated less creaky than other systems • Conclusions: • Creaky voice transformation is successful: all transformed voices are rated creaky • f0 has less effect on impression of creakiness, but it contributes to naturalness AWB + BDL f0 stream + creaky excitation AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB

Summary • Methods for the HMM-based synthesis of creaky voice were investigated • This requires: • method for detecting creaky voice • robust pitch tracker and voicing decision • prediction of creaky voice from contextual factors • dedicated vocoder for rendering the creaky excitation • Evaluation showed a significant improvement in naturalness and creakiness • Transformation of a non-creaky speaker to a creaky one was successful Thank you!

HMM-Based Synthesis of Creaky Voice

HMM-Based Synthesis of Creaky Voice

Presentation Transcript

HMM An Initial Study on HMM-based TTS for Mandarin Chinese

Veterinary Synthesis Based on Synthesis 8.1

Creation of HMM-based Speech M odel for Estonian Text-to-Speech Synthesis

(solo creaky sound)

Visitor-Based HMM

Hmm…

Segmental GPD training of HMM based speech recognizer

Polymer Based Synthesis

Overview of HMM

VOICE RECOGNITION USING AN HMM BASED DESIGN

A novel irregular voice model for HMM-based speech synthesis

HMM – HMM Comparison

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Language Modeling using PLSA-Based Topic HMM

HMM-BASED PATTERN DETECTION

HMM-based speech synthesis: the new generation of artificial voices

A Bayesian Approach to HMM-Based Speech Synthesis

The creaky door

Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis