140 likes | 318 Views
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING A PARAMETER-FREE NON-LINEAR PREDICTOR. Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003. Outline. Introduction The Model EM Training Format Tracking Experiment Results
E N D
AN EXPECTATION MAXIMIZATION APPROACH FOR FORMANT TRACKING USING APARAMETER-FREE NON-LINEAR PREDICTOR Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003
Outline • Introduction • The Model • EM Training • Format Tracking • Experiment Results • Conclusion Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Introduction • Traditional methods use LPC or matching stored templates of spectral cross sections • In either case, formant tracking is error-prone due to not enough candidates or templates • This paper uses a predictor codebook of MFCC to present formant relationships • Also, this method explores the complete formant space, avoiding premature elimination in LPC or template matching Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
The Model • ot = F(xt) + rt • ot is observed MFCC coefficients • xt is vocal tract resonances (VTR) and corresponding bandwidths • F(xt) is the quantized frequency and bandwidth of formants, named predictor codebook • rt is the residual signal Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Constructing F(x) • All-pole model • Assume there are I formants • x = (F1, B1, F2, B2,……, FI, BI) • Then use z-transfrom to get H(z): • Finally, each quantized VTR x can be transformed into a MFCC series F(x) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
EM Training (1/2) • Use a single Gaussian to model rt • T frames utterance, θ is parameters (mean and covariance) of Gaussian • Assume formant values x are uniformly distributed, and can take any of C quantized values Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
EM Training (2/2) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Formant Tracking (1/2) • Frame-by-Frame Tracking • Formants in each frame are estimated independently • One-to-one Mapping (MAP) • Minimum Mean Squared Error (MMSE) Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Formant Tracking (2/2) • Tracking with Continuity Constraints • First Order State Model: xt = xt-1 + wt • wt is modeled as a Gaussian with zero mean and diagonal Σw • MAP method below can be estimated using Viterbi search • MMSE is more much complex and this paper uses an approximate method to obtain, which is not well described here Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Track 3 formants Frequencies are first mapped on mel-scale then uniformly quantized Bandwidths are simply uniformly quantized F1 < F2 < F3, so totally 767500 entries in codebook Gain = 1 MFCC is 12 dimension, without C0 20 utterances of one male speaker are used for EM Experiment Settings Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Experiment Results, “they were what” Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Experiment Results, with bandwidth Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Experiment Results, residual Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007
Conclusion • This method is totally unsupervised, needless of any labeling • Works well in unvoiced frames • No gross errors • May be applied to speech recognizing system Ts'ai, Chung-Ming, Speech Lab, NTUST, 2007