1 / 36

HMM-Based Synthesis of Creaky Voice

HMM-Based Synthesis of Creaky Voice. Tuomo Raitio John Kane Thomas Drugman Christer Gobl. Creaky voice. Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration

kateb
Download Presentation

HMM-Based Synthesis of Creaky Voice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HMM-Based Synthesis of Creaky Voice Tuomo Raitio John Kane Thomas Drugman Christer Gobl

  2. Creaky voice • Creaky voice (vocal fry) is a distinctive phonation type involving low-frequency vocal fold vibration • Highly irregular with secondary laryngeal excitations

  3. Use of creaky voice • Usually involuntary, but various systematic usages have been reported • For instance, creaky voice has been observed as • phrase boundary marker • turn-yielding mechanism • indication of hesitations • portrayal of social status • cue for communicating attitude and affective states

  4. Synthesis of creaky voice • HMM-based synthesis of creaky voice requires • Algorithm for automatic detection of creaky voice • Accurate f0 estimation and voicing decision • Prediction of creaky voice from context (text input) • Vocoder capable of rendering creaky excitation

  5. This work… Compares different f0 estimation methods suitable for building creaky voice synthesis Culminates the previous research by creating a framework for creaky voice synthesis Explores the conversion of normal synthetic voice to a creaky one

  6. What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?

  7. Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  8. A) Use a database of creaky voice

  9. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  10. B) Replace f0 estimation method with one suitable for creaky voice

  11. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  12. f0 estimation of creaky voice • Creaky voice has low f0 and irregular excitation • Many f0 trackers output spurious values or classify creak as unvoiced • Range of state-of-the-art f0 estimation algorithms were evaluated with creaky voice: • GlottHMM • SWIPE (with SPTK 3.6 voicing decision) • RAPT (SPTK 3.6) • SPTK 3.1 cepstrum based pitch function • STRAIGHT TEMPO

  13. f0 estimation of creaky voice – Evaluation • Methods were mostly used with default settings • Frame length was set to 45ms whenever possible • Speech data: • 3databases of read speech for TTS development • American English male BDL • Finnish male MV • Finnish female HS • Conversational speech data from 7 other speakers (Swedish, Japanese, American English)

  14. f0 estimation of creaky voice – Results • GlottHMM [1] performed best with TTS data • SPTK performed best with conversational speech • For creaky voice TTS development, GlottHMM f0 estimation was chosen • [1] Raitio, Suni, Yamagishi, Pulakka, Nurminen, Vainio & Alku, “HMM-based speech synthesis utilizing glottal inverse filtering”, in IEEE Trans. on Audio, Speech, and Lang. Proc., 2011

  15. What modification are required in order to construct a creaky voice synthesis from a conventional HTS system?

  16. C) Detect creaky regions and model creak as a special case

  17. Training Labels Speech data Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  18. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  19. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 f0, voicing decision Voice model Synthesis Parameter generation - f0 - spectrum Synthesis Speech Text Front-end

  20. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

  21. Creaky voice detection • Hand-annotation too laborious  automatic methods • An automatic creaky voice detection method by Kane & Drugman [1,2] • Based on linear prediction (LP) residual features • [1] Drugman, Kane & Gobl, “Resonator-based Creaky Voice Detection”, Interspeech, 2012 • [2] Kane, Drugman & Gobl, “Improved automatic detection of creak”, Computer Speech & Language, 2013

  22. Probability of creak LP residual

  23. Modeling creaky excitation • Extensionof the deterministic plus stochastic model (DSM) [1,2] which integrates a proper modeling of creaky voice • [1] Drugman, Kane & Gobl,, “Modeling the creaky excitation for parametric speech synthesis”, Interspeech, 2012 • [2] Drugman & Dutoit, “The Deterministic plus Stochastic Model of the Residual Signal and its Applications”, in IEEE Trans. on Audio, Speech and Lang. Proc., 2012.

  24. Deterministic component Envelope of the stochastic component Main excitation Secondary excitation GCI GCI GCI GCI GCI GCI

  25. Training Labels Speech data (creaky) Spectrum estimation HMM training - spectrum - f0 - creaky probability f0, voicing decision Creaky voice detection Creaky voice model Average creaky residual Extract creaky excitation Synthesis Parameter generation - creaky probability - f0 - spectrum Synthesis (normal/ creak) Speech (creaky) Text Front-end

  26. Voice building and synthesis • Training: • Standard HTS method with the addition of 1-dimensional stream of creaky probability • Spectrum: 30th order mel-generalized cepstral analysis with alpha = 0.42 and gamma = -1/3 (converted to LSFs) • Synthesis: • Excitation: DSM vocoder with creaky parts rendered with the creaky excitation • Excitation was filtered with the mel-generalized log spectral approximation (MGLSA) filter

  27. Evaluation • The following systems were compared • Conventional (STRAIGHT f0) • Proposed (GlottHMM f0) • Proposed (GlottHMM f0 and creaky excitation) • Subjective online listening tests • Stimuli: 20 sentences from the held-out data of BDL and MV • 29 tests subjects

  28. Evaluation – MOS naturalness • Results indicate that systems 2 and 3 have higher (p<0.001) ratings than 1 • Difference between systems 2 and 3 is not significant • Conclusions: • Use of GlottHMM f0 improves naturalness • Modeling of creaky excitation has no effect on MOS STRAIGHT f0 GlottHMM f0 GlottHMM f0 + creaky excitation

  29. Evaluation – Creaky rendering • Pairwise comparison of samples • Systems 2 and 3 are preferred over system 1 • System 3 is preferred over system 2 • Conclusions: • Both the use of GlottHMM f0 and the modeling of creaky excitation improve creaky voice rendering No pref. No pref. No pref. 2 GlottHMM f0 3 3 GlottHMM f0 + cr. exc. GlottHMM f0 + cr. exc. 2 GlottHMM f0 1 1 STRAIGHT f0 STRAIGHT f0

  30. Is it possible to transplant a creaky voice quality to a non-creaky speaker?

  31. Adding creak for non-creaky speaker • Convert non-creaky voice of Scottish English male AWB to creaky • Transplantation strategy: • Creaky voice is predicted from American English male BDL • Creaky excitation pulse from BDL is used to render creak • f0 is either: • kept as is • substituted with BDL f0 by stream substitution • transformed only in the creaky parts

  32. Evaluation • Four different voices were built: • AWB (baseline) • AWBwith BDL creaky excitation • AWBwith BDL creaky excitation andBDL f0 • AWBwith BDL creaky excitation and f0 transformation

  33. Evaluation • Subjective online listening tests • 14 tests subjects • 28 synthesized stimuli • Samples were rated with two scales: • Standard MOS naturalness • Impression of creakiness from 1 to 5 • 1 – does not sound like creaky voice • 2 – • 3 – • 4 – • 5 – sounds exactly like creaky voice

  34. Evaluation results – MOS • System 3 is rated lower than system 1 • No other statistically significant differences • Conclusions • Creaky voice transformation does not decrease naturalness, except when f0 of BDL was used • Degradation of system 3 is probably due to different prosody AWB + f0 transformation + creaky excitation Baseline AWB AWB + creaky excitation AWB + BDL f0 stream + creaky excitation

  35. Evaluation results – Creakiness • System 1 is rated less creaky than other systems • Conclusions: • Creaky voice transformation is successful: all transformed voices are rated creaky • f0 has less effect on impression of creakiness, but it contributes to naturalness AWB + BDL f0 stream + creaky excitation AWB + f0 transformation + creaky excitation AWB + creaky excitation AWB

  36. Summary • Methods for the HMM-based synthesis of creaky voice were investigated • This requires: • method for detecting creaky voice • robust pitch tracker and voicing decision • prediction of creaky voice from contextual factors • dedicated vocoder for rendering the creaky excitation • Evaluation showed a significant improvement in naturalness and creakiness • Transformation of a non-creaky speaker to a creaky one was successful Thank you!

More Related