Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Overview of NIT HMM-basedspeech synthesis systemfor Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

Background • HMM-based speech synthesis • Quality of synthesized speech depends on acoustic models • Model estimation is one of the most important problem • Appropriate training algorithm is required • Deterministic annealing EM (DAEM) algorithm • To overcome the local maxima problem • Step-wise model selection • To perform the joint optimization of model structures and state sequences

Outline • HMM-based speech synthesis system • Deterministic annealing EM (DAEM) algorithm • Step-wise model selection • Experiments • Conclusion & future work

Overview of HMM-based system Speech signal Speech database Excitation parameters extraction Spectral parameters extraction Label Training of HMM Contest-dependent HMMs & duration models Training part Synthesis part TEXT Parameter generation from HMM Label Text analysis Excitation parameters Spectral parameters Excitation generation Synthesized speech Synthesis filter

Base techniques • Hidden semi-Markov Model (HSMM) • HMM with explicit state duration probability dist. • Estimate state output and duration probability dists. • STRAIGHT • A high quality speech vocoding method • Spectrum, F0, and aperiodicity measures • Parameter generation considering GV • Calculate GV features from only speech region excluding silence and pause • Context dependent GV models

EM algorithm • Maximum likelihood (ML) criterion • Expectation Maximization (EM) algorithm : Model parameter : Training data : HMM state seq. ・E-step: ・M-step: Occur the local maxima problem

DAEM algorithm • Posterior probability • Model update process ・E-step: ・M-step: : Temperature parameter ・Increase temperature parameter

Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time All state sequences have uniform probability

Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Change from uniform to sharp

Optimization of state sequence • Likelihood function in the DAEM algorithm State output probability State transition probability Time Estimate reliable acoustic models

Problem of context clustering • Context-dependent model • Appropriate model structures are required • Decision tree based context clustering • Assumption: state occupancies are not changed • State occupancies depend on model structures • State sequences and model structures should be optimized simultaneously Vowel? /a/? Silence?

Step-wise model selection • Gradually change the size of decision tree • Perform joint optimization of model structures and state sequences • Minimum Description Length (MDL) criterion : Tuning parameter : Amount of training data assigned to the root node : Number of nodes : Dimension of feature vec.

Model training process • Estimate monophone models (DAEM) • # of temperature parameter updates is 10 • # of EM-steps at each temperature is 5 • Select decision trees by the MDL criterion using the tuning parameter • Estimate context-dependent models (EM) • # of EM-steps is 5 • Decrease the tuning parameter • Tuning parameter decreases as 4, 2, 1 • Repeat from step. 2

Speech analysis conditions

Likelihood & model structure • Average log likelihood of monophone model • Number of leaf nodes • Phone set: Unilex (58 phoneme) • Number of leaf nodes (Full-context): 6,175,466

Experimental results • Compare with the benchmark HMM-based system • NIT system achieved the same performance • High intelligibility • Compare with the benchmark unit-selection system • Worse in speaker similarity • Better in intelligibility

Speech samples • Generate high intelligible speech • Include voiced/unvoiced errors • Need to improve feature extraction and excitation

Conclusion • NIT HMM-based speech synthesis system • DAEM algorithm • Overcome the local maxima problem • Step-wise model selection • Perform joint optimization of state sequences and model structures • Generate high intelligible speech • Future work • Improve feature extraction and excitation • Investigate the schedule of temperature parameters and step-wise model selection

Thank you

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Presentation Transcript

Large Scale Evaluation of Corpus-based Synthesizers: The Blizzard Challenge 2005

A Text-to-Speech Synthesis System

Speech synthesis

Speech recognition using HMM

U nit overview

Speech Recognition and HMM Learning

U nit overview

Creation of HMM-based Speech M odel for Estonian Text-to-Speech Synthesis

Visitor-Based HMM

Segmental GPD training of HMM based speech recognizer

Speech Synthesis

Speech Synthesis

Overview of HMM

A novel irregular voice model for HMM-based speech synthesis

HMM-Based Synthesis of Creaky Voice

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

U nit Overview

HMM-based speech synthesis: the new generation of artificial voices

Speech Synthesis

A Bayesian Approach to HMM-Based Speech Synthesis

Numerical Text-to-Speech Synthesis System

Synthesis Unit and Question Set Definition for Mandarin HMM-based Singing Voice Synthesis