200 likes | 346 Views
A Bayesian Approach to HMM-Based Speech Synthesis. 1. 1. 1. 2. 1. Kei Hashimoto , Heiga Zen , Yoshihiko Nankaku , Takashi Masuko , and Keiichi Tokuda Nagoya Institute of Technology Tokyo Institute of Technology. 1. 2. Background. HMM-based speech synthesis system
E N D
A Bayesian Approach to HMM-Based Speech Synthesis 1 1 1 2 1 Kei Hashimoto , Heiga Zen , Yoshihiko Nankaku , Takashi Masuko , and Keiichi Tokuda Nagoya Institute of Technology Tokyo Institute of Technology 1 2
Background • HMM-based speech synthesis system • Spectrum, excitation and duration are modeled • Speech parameter seqs. are generated • Maximum likelihood (ML) criterion • Train HMMs and generate speech parameters • Point estimate ⇒ The over-fitting problem • Bayesian approach • Estimate posterior dist. of model parameters • Prior information can be use ⇒ Alleviate the over-fitting problem
Outline • Bayesian speech synthesis • Variational Bayesian method • Speech parameter generation • Bayesian context clustering • Prior distribution using cross validation • Experiments • Conclusion & Future work
Bayesian speech synthesis (1/2) Model training and speech synthesis ML Bayes : Model parameters : Synthesis data seq. : Label seq. for synthesis : Label seq. for training : Training data seq.
Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) : HMM state seq. for synthesis data : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters Variational Bayesian method [Attias; ’99]
Variational Bayesian method(1/2) Estimate approximate posterior dist. ⇒ Maximize a lower bound : Expectation w.r.t. (Jensen’s inequality) : Approximate distribution of the true posterior distribution
Variational Bayesian method(2/2) • Random variables are statistically independent • Optimal posterior distributions : normalization terms Iterative updates as the EM algorithm
Approximation for speech synthesis • is dependent on synthesis data ⇒ Huge computational cost in the synthesis part • Ignore the dependency of synthesis data ⇒ Estimation from only training data
Prior distribution :Covariance of prior data • Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist. • Determination using statistics of prior data Likelihood function Conjugate prior distribution :# of prior data : Dimension of feature :Mean of prior data
Speech parameter generation • Speech parameter Consist of static and dynamic features ⇒ Only static feature seq. is generated • Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound
Relation between Bayes and ML Compare with the ML criterion • Use of expectations of model parameters • Can be solved by the same fashion of ML Output dist. ⇒ ML ⇒ Bayes
Outline • Bayesian speech synthesis • Variational Bayesian method • Speech parameter generation • Bayesian context clustering • Prior distribution using cross validation • Experiments • Conclusion & Future work
Bayesian context clustering :Is this phoneme a vowel? Gain of Select question Stopping condition Context clustering based on maximizing yes no ⇒ Split node based on gain
Impact of prior distribution • Affect model selection as tuning parameters ⇒ Require determination technique of prior dist. • Conventional: maximize the marginal likelihood • Lead to the over-fitting problem as the ML • Tuning parameters are still required • Determination technique of prior distribution using cross validation [Hashimoto; ’08]
Bayesian approach using CV Training data is randomly dividedinto K groups 2,3 1,3 1,2 Posterior dist. Calculate likelihood Prior distribution based on Cross Validation Cross valid prior dist.
Outline • Bayesian speech synthesis • Variational Bayesian method • Speech parameter generation • Bayesian context clustering • Prior distribution using cross validation • Experiments • Conclusion & Future work
Experimental conditions (2/2) • Compared approach • Mean Opinion Score (MOS) test • Subjects were 10 Japanese students • 20 sentences were chosen at random
Subjective listening test Mean opinion score 2,491 25,911 2,553 27,106
Conclusions and future work • A new framework based on Bayesian approach • All processes are derived from a single predictive distribution • Improve the naturalness of synthesized speech • Future work • Introduce HSMM instead of HMM • Investigate the relation between the speech quality and model structures