530 likes | 726 Views
HMM-based speech synthesis: the new generation of artificial voices. Thomas Drugman thomas.drugman@umons.ac.be. TCTS Lab. « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhD Students. TCTS Lab. Image & Video. Numerical Arts. Audio & Speech.
E N D
HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman thomas.drugman@umons.ac.be
TCTS Lab « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhDStudents TCTS Lab Image & Video Numerical Arts Audio & Speech Drugman Thomas
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Speech Synthesis Text-to-speech system « Hello » GOAL : Produce the lecture of an unknowntexttyped by the user Drugman Thomas
Challenges • Naturalness • Intelligibility • Cost-effectiveness • Expressivity Drugman Thomas
Challenge 3 : Cost-effectiveness Industry expects Intelligibility + Naturalness + … • Small footprint : a few Megs • Small CPU requirements (embedded market) • Easy extension to other languages • Possibility to create new voices as fast as possible • Through automatic recording/segmentation process • Through efficient voice conversion • Possibility to bootstrap an existing TTS voice into any voice Drugman Thomas
Challenge 4 (new) : Expressivity =“Emotional speech synthesis” (art!) • Being able to render an expressive voice • In terms of prosody • In terms of voice quality • Knowing when to do it (yet unsolved) • Today’s holy grail for the industry • Strategic advantage for whoever gets it first • News markets (ebooks?) Drugman Thomas
Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas
Main bellows Nostrils Mouth Small bellows 'S' pipe 'S' lever 'Sh' lever 'Sh' pipe Von Kempelen’s talking machine (1791) Prof. Thierry Dutoit
Omer Dudley’s Voder (Bell Labs, 1936) Prof. Thierry Dutoit
And other developments in articulatory synthesis • Work by : K. Stevens, G. Fant, P. Mermelstein, R. Carré (GNUSpeech), S. Maeda, J. Shroeter & M. Sondhi… • More recently : O. Engwall, S. Fels (ArtiSynth), Birkholz and Kröger, A. Alwan & S. Narayanan (MRI)… Prof. Thierry Dutoit
Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Prof. Thierry Dutoit
Methods for Speech Synthesis • Expert-based (rule-based) approach • Corpus-based approach • Diphone concatenation • Unit Selection • Statistical parametric synthesis (“HMM-based synthesis”) Drugman Thomas
Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity
Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Statistical Parametric Speech Synthesis DATABASE Speech Parameters Statistical Modeling Speech Analysis TRAINING SPS Synthesizer SYNTHESIS Speech Parameters Speech Processing Statistical Generation « Hello !» Hello!
HMM-based speech synthesis http://hts.sp.nitech.ac.jp/ IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?
TRAINING OF THE HMM-BASED SYNTHESIZER
Parameter extraction Pulse train Synthetic Speech Filter White noise
Labels Labels consist of phoneticenvironment description • Contextualfactors: • Phone identity • Syntaxicalfactors • Stress-relatedfactors • Locational , …
Labels Example
System architecture Contextualfactorsmay affect duration, source and filterdifferently ContextOrientedClustering usingDecisionTrees
System architecture State Duration Model HMM for Source and Filter Decision tree for State Duration Decision trees for Filter Decision trees for Source
Training decision trees An exhaustive list of possible questions is first drawn up Example : QS "LL-Nasal" {m^*,n^*,en^*,ng^*} QS "LL-Fricative" {ch^*,dh^*,f^*,hh^*,hv^*,s^*,sh^*,th^*,v^*,z^*,zh^*} QS "LL-Liquid" {el^*,hh^*,l^*,r^*,w^*,y^*} QS "LL-Front" {ae^*,b^*,eh^*,em^*,f^*,ih^*,ix^*,iy^*,m^*,p^*,v^*,w^*} QS "LL-Central" {ah^*,ao^*,axr^*,d^*,dh^*,dx^*,el^*,en^*,er^*,l^*,n^*,r^*,s^*,t^*,th^*,z^*,zh^*} QS "LL-Back" {aa^*,ax^*,ch^*,g^*,hh^*,jh^*,k^*,ng^*,ow^*,sh^*,uh^*,uw^*,y^*} QS "LL-Front_Vowel" {ae^*,eh^*,ey^*,ih^*,iy^*} QS "LL-Central_Vowel" {aa^*,ah^*,ao^*,axr^*,er^*} QS "LL-Back_Vowel" {ax^*,ow^*,uh^*,uw^*} QS "LL-Long_Vowel" {ao^*,aw^*,el^*,em^*,en^*,en^*,iy^*,ow^*,uw^*} QS "LL-Short_Vowel" {aa^*,ah^*,ax^*,ay^*,eh^*,ey^*,ih^*,ix^*,oy^*,uh^*} QS "LL-Dipthong_Vowel" {aw^*,axr^*,ay^*,el^*,em^*,en^*,er^*,ey^*,oy^*} QS "LL-Front_Start_Vowel" {aw^*,axr^*,er^*,ey^*} Total: about 1500 questions
Training decision trees Decisiontrees are trainedusing a Maximum Likelihoodcriterion Example :
Emission likelihood and training Finally, eachleafismodeled by a Gaussian Mixture Model (GMM) Training isguided by the Viterbi and Baum-Welchre-estimation algorithms
SYNTHESIS BY THE HMM-BASED SYNTHESIZER
Parameters generation Given the sequence of labels, durations are determined by maximizing the state sequencelikelihood A trajectorythroughcontext-dependent HMM states isknown !
Parameters generation Usingthistrajectory, source and filterparameters are generated by maximizing the output probability Dynamicfeaturesevolution more realistic and smooth
Speech synthesizers comparison Quality Unit Selection HTS Diphone Concatenation Footprint 200Mb 5Mb <1Mb
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Problem positioning Parametric speech synthesizers generallysufferfrom a typicalbuzzinessas encountered in LPC-likevocoders Source–Filterapproach: Enhance the excitation signal Pulse train Synthetic Speech Filter White noise
Proposed solution SOURCE FILTER T.Drugman, G.Wilfart, T.Dutoit, « A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis », Interspeech09
Results Traditional: Proposed:
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Problem of oversmoothing Drugman Thomas
Compensation of oversmooting Drugman Thomas
Global Variance Drugman Thomas
Global Variance Drugman Thomas
Results Drugman Thomas
Content • Speech synthesis: history • HMM-based speech synthesis • Parametricmodeling of speech • Statisticalgeneration • Conclusions
Speech synthesizers comparison Rule-based synthesis IntelligibilityNaturalnessMem/CPU/VoicesExpressivity Diphone concatenation IntelligibilityNaturalness~Mem/CPU/VoicesExpressivity Unit selection IntelligibilityNaturalness Mem/CPU/Voices ~ Expressivity~ HMM-based speech synthesis IntelligibilityNaturalness ?Mem/CPU/Voices Expressivity?