20 likes | 133 Views
20.4. The value of F. Phoneme Acc. (%). 20.3. 20.2. : Root node. mean vector. 0. 2. 4. 6. 8. 10. T. T. : Root node. 20,000 sentences. 20,000 sentences. covariance matrix. 79. : Variational posterior. ,. 78. 77. 76. 75. 0. 2. 4. 6. 8. 10. P. F. : Leaf node index.
E N D
20.4 The value of F Phoneme Acc. (%) 20.3 20.2 : Root node mean vector 0 2 4 6 8 10 T T : Root node 20,000 sentences 20,000 sentences covariance matrix 79 : Variational posterior , 78 77 76 75 0 2 4 6 8 10 P F : Leaf node index Q : Leaf node occupation 74 : Posterior parameters 79 Y N P = + - ΔF F F F • Stop node split if : Leaf node mean vector Q Q Q Q Y N F F Likelihood 78 73 : Observation vectors Evidence Q Q : Hidden variables Phoneme Acc. (%) Phoneme Acc. (%) 77 72 : Model parameters 76 Prior distribution all 71 75 0 20 40 60 80 100 0 20 40 80 100 60 phone Tree Size (%) 20,000 sentences Tree Size (%) 2,500 sentences : Input vectors … /a/ /N/ state : Observation vector 58 … /a/.state[2] /a/.state[4] 54 Posterior distribution : Mean vector Phoneme Acc. (%) 50 : Inverse covariance leaf 46 100 0 20 40 80 60 : Hyperparameters Tree Size (%) 200 sentences Predictive distribution : Number of dimensions Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi Tokuda(Nagoya Institute of Technology) 1. Introduction 3. Variational Bayesian Approach 4. Hyperparameter Estimation 6. Experimental Results Relationships between F and recognition Acc. • Estimating appropriate hyperparameters • Maximize F w.r.t. hyperparameters • Variational Bayes [Attias;1999] • Approximate posterior distributions • by variational method • Define a lower bound on the log-likelihood • Recent speech recognition systems • ML(Maximum Likelihood) criterion • ⇒ Reduce the estimation accuracy • MDL(Minimum Description Length) criterion • ⇒ Based on an asymptotic approximation Conventional • Using monophone HMM state statistics • ⇒ Maximizing F at the root node • Variational Bayesian(VB) approach • Higher generalization ability • Appropriate model structures can be selected • Performance depends on hyperparameters ⇒ • F and recognition accuracy • behaved similarly Objective Proposed • Proposed technique gives • consistent improvement at the value of F Estimate hyperparameters maximizing marginal likelihood ⇒ Maximize F w.r.t. variational posteriors • Using the statistics of all leaf nodes • ⇒ Maximizing F of the tree structure • If prior distributions have tying structure • ⇒ F is good for model selection Context Clustering based on Variational Bayes [Watanabe et al. ;2002] • Otherwise • ⇒ F increases monotonically as T increases ⇒ 2. Bayesian Framework Maximize F w.r.t. variational posteriors Relationships between tying structure and the amount of training data Q : Phonetic question Yes No Tying structure of prior distributions Consider four kinds of tying structure • Use a conjugate prior distribution • Output probability distribution • ⇒ Gaussian distribution • Conjugate prior distribution • ⇒ Gauss-Wishart distribution Based on the posterior distributions Model parameters are regarded as probabilistic variables 5. Experimental Conditions • The VB clustering with appropriate prior • distribution improves the recognition performance • Advantages • Prior knowledge can be integrated • Model structure can be selected • Robust classification • Likelihood function • ⇒ Proportional to a Gauss-Wishart distribution • Appropriate tying structure of prior distributions • ⇒ Depend on the amount of training data • Define new hyperparameter T representing • the amount of prior data ・ Large training data set ⇒ Tying few prior distributions • Disadvantage • Include integral and expectation calculations • ⇒ Effective approximation technique is required ・ Small training data set ⇒ Tying many prior distributions