230 likes | 353 Views
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. The Centre for Speech Technology Research The University of Edinburgh. Outline. Introduction Voice source model System
E N D
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh
Outline • Introduction • Voice source model • System • Perceptual evaluation • Concluding remarks • Future work
Training speech F0 extraction Spectral features estimation Text analysis Text HMMs F0 spectrum Pulse train Synthetic Speech Synthesis filter + Noise component IntroductionHMM-based speech synthesizer [Tokuda et al]
Voice source modelObtaining the glottal source signal • Source-filter model: • Inverse filtering: Source Ug Vocal tract A(z) Lip radiation d/dz Speech Lip radiation cancellation (∫) Inverse Filter 1/A(z) Speech
Voice source modelLiljencrants-Fant model (LF-model) T : period to : opening instant tp : instant of max airflow te : instant of max excitation ta : return phase duration tc : closing instant Ee : excitation amplitude
Voice source modelOther parameters of the LF-model Open quotient: Speed quotient: Return quotient:
Fg glottal spectral peak Fc spectral tilt Voice source modelDescription of the LF-model spectrum Linear stylization of the LF-model spectrum [Doval and d’Alessandro]
Voice source modelFeatures extraction • utterances sampled at 16 kHz • pitch-synchronous analysis (ESPS tools) • LPCs calculated with windows centered at the glottal epochs and duration 20ms • inverse filtering to estimate DGS • pre-emphasis filter (α=0.97) • low-pass filtering of the residual at 4 kHz
Voice source modelEstimation of te and Ee • te and Ee are estimated from the pitch-marks
Voice source modelEstimation of tc, tp and to [Gobl & Chasaide]
Voice source modelEstimation of ta Fs : sampling frequency m : slope of the tangent at t=te
Voice source modelExamples of the estimated parameters Curves of the LF-parameters for 2 voiced regions of an utterance
SystemGeneral description - Nitech-HTS 2005 system - STRAIGHT method for analysis and synthesis - mixed multi-band excitation with phase manipulation / pulse train - Mel Log Spectrum Approximation (MLSA) filter How was the LF-model integrated in the synthesizer?
SystemGeneration of the periodic excitation (pulse signal) • Pulse centered within the frame • multiplied by asymmetric widows • summed with Gaussian noise
SystemPeriodic excitation with the LF-model • 2 LF-waveforms centered at the instant te • multiplied by asymmetric widows • summed with Gaussian noise
Solution: Post-filter Linear phase FIR filter: -6dB/dec 1Hz ≤ f≤ Fg (Hz) +6dB/dec Fg < f ≤ Fc (Hz) +12dB/dec Fc < f ≤ 16 kHz SystemTechnical problem • Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train
Perceptual evaluationGeneration of the stimuli • Built US-English voice EM001 provided by ATR for the Blizzard Challenge • Glottal parameters were measured in 8 utterances and the mean values were calculated • Simple excitation, without multi-band noise or phase manipulation • Ten utterances were synthesized, using the LF-model and the pulse model
Perceptual evaluationExperiment • Forced-choice test • Presented via a web-interface browser • Subjects were asked if they used headphones or speakers, and if they were native speakers (U.K./U.S.) • 18 listeners (7 native speakers of English) • Listeners panel was mainly university students and staff Example of test speech signals: Pulse: LF-model:
Conclusions • Nitech-HTS 2005 speech synthesizer was implemented with the LF-model for the voice source • Results showed that the LF-model can give better speech quality than the traditionally used pulse train • Direct methods used for the estimation of the mean LF-parameters seemed to perform well • A technical problem with the integration of the LF-model in the system was solved using a post-filter
Future work • To find better analysis/synthesis methods to use with the LF-model in the HMM-based speech synthesis • To evaluate the speech quality when using the mixed excitation with the LF-model • To implement voice quality transformations using the LF-model • To evaluate the parameterization methods • To model the glottal parameters with HMMs
Acknowledgements This work was financially supported by the Marie Curie EdSST programme. Thank you!