Linear Dynamic Model (LDM) for Automatic Speech Recognition

Linear Dynamic Model (LDM) for Automatic Speech Recognition PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University

An Example of Kalman Filter (another name of LDM) • In control system engineering, Kalman Filter succeeds to model a system with noisy observations Filtering: Position at present time (remove noise effect) Predicting: Position at a future time Smoothing: Position at a time in the past Observation A Kalman Filter models the position evolution

Outline • Why Linear Dynamic Model (LDM)? • Linear Dynamic Model • Pilot experiment: LDM phone classification on Aurora 4 • Hybrid HMM/LDM decoder architecture for LVCSR • Future work

HMM & Speech Recognition System Hidden Markov Models

Is HMM a perfect model for ASR? • Progress on improving the accuracy of HMM-based system has slowed in the past decade • Theory drawbacks of HMM • False assumption that frames are independent and stationary • Spatial correlation is ignored (diagonal covariance matrix) • Limited discrete state space Accuracy Clean Noisy Time

Motivation of Linear Dynamic Model (LDM) Research • Motivation • A model which reflects the characteristics of speech signals will ultimately lead to great ASR performance improvement • LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy • “Filter” characteristic of LDM has potential to improve noise robustness of speech recognition • Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine

State Space Model • Linear Dynamic Model (LDM) is derived from State Space Model • Equations of State Space Model: • y: observation feature vector • x: corresponding internal state vector • h(): relationship function between y and x at current time • f(): relationship functionbetween current state and all previous states • epsilon: noise component • eta: noise component

Linear Dynamic Model • Equations of Linear Dynamic Model (LDM) • Current state is only determined by previous state • H, F are linear transform matrices • Epsilon and Eta are driving components • y: observation feature vector • x: corresponding internal state vector • H: linear transform matrix between y and x • F:linear transform matrix between current state and previous state • epsilon: driving component • eta: driving component

Kalman filtering for state inference (E-Step of EM training) For a speech sound, Human Being Sound System e Kalman Filtering Estimation

RTS smoother for better inference • Rauch-Tung-Striebel (RTS) smoother • Additional backward pass to minimize inference error • During EM training, computes the expectations of state statistics • Standard Kalman Filter • Kalman Filter with RTS smoother

Maximum Likelihood Parameter Estimation (M-Step of EM training) LDM Parameters aa ae ah ao aw ay b ch d dh eh er Nothing but matrix multiplication! ………

LDM for Speech Classification ^ x aa y x ^ x ch MFCC Feature Hypothesis ^ x eh ……… HMM-Based Recognition x ^ LDM-Based Recognition aa y x ^ MFCC Feature x ch Hypothesis ^ x eh ………

Challenges of Applying LDM to ASR • Segment-based model • frame-to-phoneme information is needed before classification • EM training is sensitive to state initialization • Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM • No good mechanism for state initialization yet • More parameters than HMM (2~3x) • Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data

Pilot experiment: phone classification on Aurora 4 • Aurora 4: Wall Street Journal + six kinds of noises • Airport, Babble, Car, Restaurant, Street, and Train • Frame-to-phone alignment is generated by ISIP decoder (force align mode) • Adding language model will get 93% accuracy for clean data • 40 phones, one vs. all classifier

Hybrid HMM/LDM decoder architecture for LVCSR Confidence Measurement Best Hypothesis

Status and future work • The development of HMM/LDM hybrid decoder is still in progress • HMM/LDM hybrid decoder is Expected to be done in 2009 • ISIP HMM/SVM hybrid decoder acts as the reference for implementation • Future work • Research has proved the nonlinear effects in speech signals • Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)

Thank you! • Questions?

References • Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992. • Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993. • Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.

Linear Dynamic Model (LDM) for Automatic Speech Recognition