360 likes | 536 Views
HMM-Based Synthesis of Creaky Voice. Tuomo Raitio John Kane Thomas Drugman Christer Gobl. Creaky voice. Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration
E N D
HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl
Creaky voice • Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration • Highly irregular with secondary laryngeal excitations
Use of creaky voice • Usually involuntary, but various systematic usages have been reported • For instance, creaky voice has been observed as • phrase boundary marker • turn-yielding mechanism • indication of hesitations • portrayal of social status • cue for communicating attitude and affective states
Synthesis of creaky voice • HMM-based synthesis of creaky voice requires • Algorithm for automatic detection of creaky voice • Accurate f0 estimation and voicing decision • Prediction of creaky voice from context (text input) • Vocoder capable of rendering creaky excitation
This work… Compares different f0 estimation methods suitable for building creaky voice synthesis Culminates the previous research by creating a framework for creaky voice synthesis Explores the conversion of normal synthetic voice to a creaky one
What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?
Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
B) Replace f0 estimation method with one suitable for creaky voice
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
f0 estimation of creaky voice • Creaky voice has low f0 and irregular excitation • Many f0 trackers output spurious values or classify creak as unvoiced • Range of state-of-the-art f0 estimation algorithms were evaluated with creaky voice: • GlottHMM • SWIPE (with SPTK 3.6 voicing decision) • RAPT (SPTK 3.6) • SPTK 3.1 cepstrum based pitch function • STRAIGHT TEMPO
f0 estimation of creaky voice – Evaluation • Methods were mostly used with default settings • Frame length was set to 45ms whenever possible • Speech data: • 3databases of read speech for TTS development • American English male BDL • Finnish male MV • Finnish female HS • Conversational speech data from 7 other speakers (Swedish, Japanese, American English)
f0 estimation of creaky voice – Results • GlottHMM [1] performed best with TTS data • SPTK performed best with conversational speech • For creaky voice TTS development, GlottHMM f0 estimation was chosen • [1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011
What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?
C) Detect creaky regions and model creak as a special case
Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end
Creaky voice detection • Hand-annotation too laborious automatic methods • An automatic creaky voice detection method by Kane & Drugman [1,2] • Based on linear prediction (LP) residual features • [1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012 • [2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech & Language, 2013
Probability of creak LP residual
Modeling creaky excitation • Extensionof the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice • [1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012 • [2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.
Deterministic component Envelope of the stochastic component Main excitation Secondary excitation GCI GCI GCI GCI GCI GCI
Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end
Voice building and synthesis • Training: • Standard HTS method with the addition of 1-dimensional stream of creaky probability • Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs) • Synthesis: • Excitation: DSM vocoder with creaky parts rendered with the creaky excitation • Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter
Evaluation • The following systems were compared • Conventional (STRAIGHT f0) • Proposed (GlottHMM f0) • Proposed (GlottHMM f0 and creaky excitation) • Subjective online listening tests • Stimuli: 20 sentences from the held-out data of BDL and MV • 29 tests subjects
Evaluation – MOS naturalness • Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1 • Difference between systems 2 and 3 is not significant • Conclusions: • Use of GlottHMM f0 improves naturalness • Modeling of creaky excitation has no effect on MOS STRAIGHT f0 GlottHMM f0 GlottHMM f0 + creaky excitation
Evaluation – Creaky rendering • Pairwise comparison of samples • Systems 2 and 3 are preferred over system 1 • System 3 is preferred over system 2 • Conclusions: • Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering No pref. No pref. No pref. 2 GlottHMM f0 3 3 GlottHMM f0 + cr. exc. GlottHMM f0 + cr. exc. 2 GlottHMM f0 1 1 STRAIGHT f0 STRAIGHT f0
Is it possible to transplant a creaky voice quality to a non-creaky speaker?
Adding creak for non-creaky speaker • Convert non-creaky voice of Scottish English male AWB to creaky • Transplantation strategy: • Creaky voice is predicted from American English male BDL • Creaky excitation pulse from BDL is used to render creak • f0 is either: • kept as is • substituted with BDL f0 by stream substitution • transformed only in the creaky parts
Evaluation • Four different voices were built: • AWB (baseline) • AWBwith BDL creaky excitation • AWBwith BDL creaky excitation andBDL f0 • AWBwith BDL creaky excitation and f0 transformation
Evaluation • Subjective online listening tests • 14 tests subjects • 28 synthesized stimuli • Samples were rated with two scales: • Standard MOS naturalness • Impression of creakiness from 1 to 5 • 1 – does not sound like creaky voice • 2 – • 3 – • 4 – • 5 – sounds exactly like creaky voice
Evaluation results – MOS • System 3 is rated lower than system 1 • No other statistically significant differences • Conclusions • Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used • Degradation of system 3 is probably due to different prosody AWB + f0 transformation + creaky excitation Baseline AWB AWB + creaky excitation AWB + BDL f0 stream + creaky excitation
Evaluation results – Creakiness • System 1 is rated less creaky than other systems • Conclusions: • Creaky voice transformation is successful: all transformed voices are rated creaky • f0 has less effect on impression of creakiness, but it contributes to naturalness AWB + BDL f0 stream + creaky excitation AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB
Summary • Methods for the HMM-based synthesis of creaky voice were investigated • This requires: • method for detecting creaky voice • robust pitch tracker and voicing decision • prediction of creaky voice from contextual factors • dedicated vocoder for rendering the creaky excitation • Evaluation showed a significant improvement in naturalness and creakiness • Transformation of a non-creaky speaker to a creaky one was successful Thank you!