230 likes | 459 Views
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011. Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011. Background. HMM-based speech synthesis Quality of synthesized speech depends on acoustic models
E N D
Overview of NIT HMM-basedspeech synthesis systemfor Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011
Background • HMM-based speech synthesis • Quality of synthesized speech depends on acoustic models • Model estimation is one of the most important problem • Appropriate training algorithm is required • Deterministic annealing EM (DAEM) algorithm • To overcome the local maxima problem • Step-wise model selection • To perform the joint optimization of model structures and state sequences
Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work
Overview of HMM-based system Speech signal Speech database Excitation parameters extraction Spectral parameters extraction Label Training of HMM Contest-dependent HMMs & duration models Training part Synthesis part TEXT Parameter generation from HMM Label Text analysis Excitation parameters Spectral parameters Excitation generation Synthesized speech Synthesis filter
Base techniques • Hidden semi-Markov Model (HSMM) • HMM with explicit state duration probability dist. • Estimate state output and duration probability dists. • STRAIGHT • A high quality speech vocoding method • Spectrum, F0, and aperiodicity measures • Parameter generation considering GV • Calculate GV features from only speech region excluding silence and pause • Context dependent GV models
Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work
EM algorithm • Maximum likelihood (ML) criterion • Expectation Maximization (EM) algorithm : Model parameter : Training data : HMM state seq. ・E-step: ・M-step: Occur the local maxima problem
DAEM algorithm • Posterior probability • Model update process ・E-step: ・M-step: : Temperature parameter ・Increase temperature parameter
Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time All state sequences have uniform probability
Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Change from uniform to sharp
Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Estimate reliable acoustic models
Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work
Problem of context clustering • Context-dependent model • Appropriate model structures are required • Decision tree based context clustering • Assumption: state occupancies are not changed • State occupancies depend on model structures • State sequences and model structures should be optimized simultaneously Vowel? /a/? Silence?
Step-wise model selection • Gradually change the size of decision tree • Perform joint optimization of model structures and state sequences • Minimum Description Length (MDL) criterion : Tuning parameter : Amount of training data assigned to the root node : Number of nodes : Dimension of feature vec.
Model training process • Estimate monophone models (DAEM) • # of temperature parameter updates is 10 • # of EM-steps at each temperature is 5 • Select decision trees by the MDL criterion using the tuning parameter • Estimate context-dependent models (EM) • # of EM-steps is 5 • Decrease the tuning parameter • Tuning parameter decreases as 4, 2, 1 • Repeat from step. 2
Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work
Likelihood & model structure • Average log likelihood of monophone model • Number of leaf nodes • Phone set: Unilex (58 phoneme) • Number of leaf nodes (Full-context): 6,175,466
Experimental results • Compare with the benchmark HMM-based system • NIT system achieved the same performance • High intelligibility • Compare with the benchmark unit-selection system • Worse in speaker similarity • Better in intelligibility
Speech samples • Generate high intelligible speech • Include voiced/unvoiced errors • Need to improve feature extraction and excitation
Conclusion • NIT HMM-based speech synthesis system • DAEM algorithm • Overcome the local maxima problem • Step-wise model selection • Perform joint optimization of state sequences and model structures • Generate high intelligible speech • Future work • Improve feature extraction and excitation • Investigate the schedule of temperature parameters and step-wise model selection