HMM-based speech synthesis: the new generation of artificial voices

HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman@umons.ac.be

TCTS Lab « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhDStudents TCTS Lab Image & Video Numerical Arts Audio & Speech Drugman Thomas

Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions

Speech Synthesis Text-to-speech system « Hello » GOAL : Produce the lecture of an unknowntexttyped by the user Drugman Thomas

Challenges • Naturalness • Intelligibility • Cost-effectiveness • Expressivity Drugman Thomas

Challenge 3 : Cost-effectiveness Industry expects Intelligibility + Naturalness + … • Small footprint : a few Megs • Small CPU requirements (embedded market) • Easy extension to other languages • Possibility to create new voices as fast as possible • Through automatic recording/segmentation process • Through efficient voice conversion • Possibility to bootstrap an existing TTS voice into any voice Drugman Thomas

Challenge 4 (new) : Expressivity =“Emotional speech synthesis” (art!) • Being able to render an expressive voice • In terms of prosody • In terms of voice quality • Knowing when to do it (yet unsolved) • Today’s holy grail for the industry • Strategic advantage for whoever gets it first • News markets (ebooks?) Drugman Thomas

Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas

Main bellows Nostrils Mouth Small bellows 'S' pipe 'S' lever 'Sh' lever 'Sh' pipe Von Kempelen’s talking machine (1791) Prof. Thierry Dutoit

Omer Dudley’s Voder (Bell Labs, 1936) Prof. Thierry Dutoit

And other developments in articulatory synthesis • Work by : K. Stevens, G. Fant, P. Mermelstein, R. Carré (GNUSpeech), S. Maeda, J. Shroeter & M. Sondhi… • More recently : O. Engwall, S. Fels (ArtiSynth), Birkholz and Kröger, A. Alwan & S. Narayanan (MRI)… Prof. Thierry Dutoit

Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Prof. Thierry Dutoit

Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas

Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity 

Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~

Statistical Parametric Speech Synthesis DATABASE Speech Parameters Statistical Modeling Speech Analysis TRAINING SPS Synthesizer SYNTHESIS Speech Parameters Speech Processing Statistical Generation « Hello !» Hello!

HMM-based speech synthesis http://hts.sp.nitech.ac.jp/ IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?

TRAINING OF THE HMM-BASED SYNTHESIZER

Parameter extraction

Parameter extraction Pulse train Synthetic Speech Filter White noise

Labels

Labels Labels consist of phoneticenvironment description • Contextualfactors: • Phone identity • Syntaxicalfactors • Stress-relatedfactors • Locational , …

Labels Example

HMM training

System architecture Contextualfactorsmay affect duration, source and filterdifferently ContextOrientedClustering usingDecisionTrees

System architecture State Duration Model HMM for Source and Filter Decision tree for State Duration Decision trees for Filter Decision trees for Source

Training decision trees An exhaustive list of possible questions is first drawn up Example : QS "LL-Nasal" {m^*,n^*,en^*,ng^*} QS "LL-Fricative" {ch^*,dh^*,f^*,hh^*,hv^*,s^*,sh^*,th^*,v^*,z^*,zh^*} QS "LL-Liquid" {el^*,hh^*,l^*,r^*,w^*,y^*} QS "LL-Front" {ae^*,b^*,eh^*,em^*,f^*,ih^*,ix^*,iy^*,m^*,p^*,v^*,w^*} QS "LL-Central" {ah^*,ao^*,axr^*,d^*,dh^*,dx^*,el^*,en^*,er^*,l^*,n^*,r^*,s^*,t^*,th^*,z^*,zh^*} QS "LL-Back" {aa^*,ax^*,ch^*,g^*,hh^*,jh^*,k^*,ng^*,ow^*,sh^*,uh^*,uw^*,y^*} QS "LL-Front_Vowel" {ae^*,eh^*,ey^*,ih^*,iy^*} QS "LL-Central_Vowel" {aa^*,ah^*,ao^*,axr^*,er^*} QS "LL-Back_Vowel" {ax^*,ow^*,uh^*,uw^*} QS "LL-Long_Vowel" {ao^*,aw^*,el^*,em^*,en^*,en^*,iy^*,ow^*,uw^*} QS "LL-Short_Vowel" {aa^*,ah^*,ax^*,ay^*,eh^*,ey^*,ih^*,ix^*,oy^*,uh^*} QS "LL-Dipthong_Vowel" {aw^*,axr^*,ay^*,el^*,em^*,en^*,er^*,ey^*,oy^*} QS "LL-Front_Start_Vowel" {aw^*,axr^*,er^*,ey^*} Total: about 1500 questions

Training decision trees Decisiontrees are trainedusing a Maximum Likelihoodcriterion Example :

Emission likelihood and training Finally, eachleafismodeled by a Gaussian Mixture Model (GMM) Training isguided by the Viterbi and Baum-Welchre-estimation algorithms

SYNTHESIS BY THE HMM-BASED SYNTHESIZER

Text analysis

Parameters generation

Parameters generation Given the sequence of labels, durations are determined by maximizing the state sequencelikelihood A trajectorythroughcontext-dependent HMM states isknown !

Parameters generation Usingthistrajectory, source and filterparameters are generated by maximizing the output probability Dynamicfeaturesevolution more realistic and smooth

Speech synthesizers comparison

Speech synthesizers comparison Quality Unit Selection HTS Diphone Concatenation Footprint 200Mb 5Mb <1Mb

Problem positioning Parametric speech synthesizers generallysufferfrom a typicalbuzzinessas encountered in LPC-likevocoders Source–Filterapproach: Enhance the excitation signal Pulse train Synthetic Speech Filter White noise

Proposed solution SOURCE FILTER T.Drugman, G.Wilfart, T.Dutoit, « A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis », Interspeech09

Results Traditional: Proposed:

Problem of oversmoothing Drugman Thomas

Compensation of oversmooting Drugman Thomas

Global Variance Drugman Thomas

Results Drugman Thomas

Speech synthesizers comparison Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity  Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~ HMM-based speech synthesis IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?

HMM-based speech synthesis: the new generation of artificial voices