300 likes | 468 Views
Implementation of a speech Analysis-Synthesis Toolbox using Harmonic plus Noise Model. Didier Cadic 1 , engineering student supervised by Olivier Cappé 1 , Maurice Charbit 1 , Gérard Chollet 1 , Eric Moulines 1 (presented here by Guido Aversano 1,2 ) 2 IIASS, Vietri sul Mare (SA), Italy.
E N D
Implementation of a speech Analysis-Synthesis Toolbox using Harmonic plus Noise Model Didier Cadic1, engineering studentsupervised by Olivier Cappé1, Maurice Charbit1, Gérard Chollet1, Eric Moulines1 (presented here by Guido Aversano1,2) 2IIASS, Vietri sul Mare (SA), Italy 1Département TSI, ENST, Paris, France
Plan of the presentation • Text-to-speech: classic methods • HNM model • Analysis • Synthesis • Analysis-Synthesis examples • Conclusions
Text-To-Speech by concatenation Examples realized on the AT&T web site: English, male English, female (vocal server example) English, female (another vocal server example) German, male French, female
Text-To-Speech by concatenation 2 major challenges : • smooth connection between acoustic units • flexible prosody
Analysis : TD-PSOLA method • Pitch estimation • Pitch-synchronous windowing Synthesis : • Rearrangement of frames
TD-PSOLA method Some very good-quality results: • Time-scaling Singing, original Singing, modified • Pitch-shifting Cello, original Cello, modified
Artifacts appearing in non-voiced sounds: TD-PSOLA method "ss", original "ss", slowed down (classic method) "ss", slowed down (improved) "rain", original "rain", 0.5 rate
Phase Vocoder method Intuitive description: Compression/stretchingof (narrow-band) spectrogram’s time-frequency scales… time-scaling pitch-shifting
Main problem : Phase Vocoder method • phase coherence is lost in the synthesized signal Examples : "rain", male voice Slow-motion by Vocoder (PSOLA : ) "The quick fox …", female voice Slow-motion by Vocoder
We need a parametric model • TD-PSOLA and Vocoder allow basic prosodic modifications. • The problem of unit concatenation for TTS is not solved. • Other kinds of modifications (timbre, denoising, …) should be considered.
Harmonic plus Noise Model (HNM) • Main assumption : • stationary segments of a speech signal can be always seen as the superposition of a periodic and a noisy part
HNM Model Modelling : = + S(t) H(t) B(t) where : H(t) = Ak cos ( 2 k f0 t + k ) and B(t) = white noise passed through an AR filter
HNM analysis of a frame • Pitch estimation Spectral comb method
HNM analysis of a frame • Pitch estimation "aka…aga" • Good results are obtained • In some cases the method erroneously returns f0/2 • Possibility of tracking…
min s(t) – H(t) 2 ak, bk HNM analysis of a frame • Harmonic part: extraction of amplitudes Least squares method H(t) = akcos ( 2k f0 t ) + bksin ( 2k f0 t )
HNM analysis of a frame • Extraction of amplitudes Problem: the noisy part gives anon-null contribution to the spectral power • Gain correction for the harmonics(using an euristic formula g(DV), where DV is the estimated voicing degree)
HNM analysis of a frame • Extraction of amplitudes Residual: R(t) = s(t) - H(t)
HNM analysis of a frame • Extraction of amplitudes Possibility of improving harmonic estimation
1 a0 + a1 z-1 + … + aN z-N HNM analysis of a frame AR filter estimation for the residual: R(t) = Bg F(t) where Bg = gaussian white noise and F(t) = AR filter, F(z) = Linear prediction method
. k(ta) = 2k f0(ta) is known by pitch analysis HNM Synthesis • Interpolation for each harmonic between two succesive frames H(t) = ak(t)cos ( 2k f0(t)t ) + bk(t)sin ( 2k f0(t)t ) = = Ak(t)cos k(t) Ak(ta) and k(ta) are known at analysis instants ta
HNM Synthesis Erroneous pitch (usually f0/2) • harmonic correspondence problem is solved introducing fictitious harmonics
Linear interpolation Unwrapping + cubic interpolation HNM Synthesis Ak cos k(t)
HNM Synthesis Noisy part • Generation of normally distributed random numbers • AR filtering (abrupt changes of coefficients between 2 windows have no incidence…)
original original original original original original "wazi" : a-e-i-o-u : Tuba : "Carottes" : singing : "Lawyer" : synthesized synthesized synthesized synthesized synthesized synthesized HNM Synthesis Results
original original original original "coiffe" : Discours : "aka aga" : Andie : synthesized synthesized synthesized synthesized HNM Synthesis Results original synthesized Dussolier : noisy part
Synthesis with time-stretching Synthesis instants (ts) Analysis instants (ta) The following parameters remain unchanged: • Noisy part parameters • The pitch • The amplitudes Ak of the harmonics
Synthesis with time-stretching Phase adaptation • Simple phase trajectories resampling or • "harmonic" rephasing original a-e-i-o-u : slow-motion with phase "stretching" slow-motion with "harmonic" rephasing
Final results Synthesized with rate : Original 1 0.4 0.5 0.6 0.7 0.8 1.2 1.5 2 "carottes" : "lawyer" : tuba : "wazi" : singing : "a-e-i-o-u" : Dussolier : Discours : Andie : "aka aga": "coiffe" :
Conclusions • Good results, showing method’s potential for different applications including TTS • Future work will include other kinds of modifications (pitch shifting, timbre etc.)