190 likes | 792 Views
Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition. S. Srinivasan , T. Ma, D. May, G. Lazarou and J. Picone Department of Electrical and Computer Engineering Mississippi State University
E N D
Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone Department of Electrical and Computer Engineering Mississippi State University URL:http://www.ece.msstate.edu/research/isip/publications/conferences/interspeech/2008/mar_hmm/ This material is based upon work supported by the National Science Foundation under Grant No. IIS-0414450. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Abstract Gaussian mixture models are a very successful method for modeling the output distribution of a state in a hidden Markov model (HMM). However, this approach is limited by the assumption that the dynamics of speech features are linear and can be modeled with static features and their derivatives. In this paper, a nonlinear mixture autoregressive model is used to model state output distributions (MAR-HMM). Estimation of model parameters is extended to handle vector features. MAR-HMMs are shown to provide superior performance to comparable Gaussian mixture model-based HMMs (GMM-HMM) with lower complexity on two pilot classification tasks.
Motivation • Popular approaches to speech recognition use: • Gaussian Mixture Models (GMMs) to model state output distributions in hidden Markov model-based acoustic models. • Additional passes of discriminative training to optimize model performance. • Discriminative classifiers such as Support Vector Machines for rescoring or combining hypotheses. • Limitations of current approaches: • GMMs model the marginal pdf and do not explicitly model dynamics. • Adding differential features to GMMs still only allows modeling of linear dynamic system and does not directly model nonlinear dynamics. • Extraction of features that explicitly model nonlinear dynamics of speech [May, 2008] has not provided significant improvements that are robust to noise or mismatched training conditions. • Goal: Directly capture dynamic information in our acoustic model.
Nonlinear Dynamic Invariants as Features (May, 2008) • Nonlinear Features: Quantify the amount of nonlinearity in the signal (e.g. Lyapunov exponents, fractal dimension, and correlation entropy). • Pros: Invariant to signal transformations: noise distortions, channel effects (at least in theory). This leads to expectations that nonlinear invariants could be noise-robust. • Cons: Contrary to expectations, experiments with continuous speech data show that performance with invariants are not noise-robust (May, 2008). While there is a 11.1% relative increase in recognition performance with clean evaluation set, a relative 7.6% decrease is noted with noisy set. • Hypotheses: • (Theoretical) Even when estimated correctly, the amount of nonlinearity is only a very crude measure of dynamic information. Invariant values can be similar even when the signals are quite different. This places a limit to the gain achievable by using nonlinear invariants as features. • (Practical) Invariant feature estimation algorithms are very sensitive to changes in channel conditions – almost never “invariant.”
A Mixture Autoregressive Model (MAR) MAR (Wong and Li, 2000) of order p with m components: where, εi: zero mean Gaussian with variance σj2 “w.p. wi” : with probability wi ai,j(j>0) : AR predictor coefficients ai,0 : mean for component i Additional nonlinear components – weighted sum of Gaussian AR processes
Interpretations • MAR as a weighted sum of Gaussian autoregressive (AR) processes: • Analysis/Modeling: Conditional distribution modeled as a weighted combination of AR filters with Gaussian white noise inputs (hence, this models data as colored non-Gaussian signals). • Synthesis: A random process in which data sampled at any one point in time is generated from one of the components of an AR mixture process chosen randomly according to its weight, wi. • Comparisons to GMMs: • GMMs model the marginal distribution as a weighted sum of Gaussian random variables. • MAR with AR filter order 0 is equivalent to a GMM. • MAR can capture information in dynamics in addition to the static information captured by a GMM.
Overview of the MAR-HMM Approach • An example MAR-HMM system with 3 states: each state observation is modeled by a 2-component MAR.
MAR Properties • Ability to model nonlinearities: • Each AR component linear. • Probabilistic mixing of AR processes leads to nonlinearities. • Dynamics of probability distribution: • In GMM: • Distribution invariant to past samples. • Only marginal distribution is modeled. • In MAR: • Conditional distribution varies with past samples. • Joint distribution of random process is modeled. • Nonlinear evolution in time series is modeled using a probabilistic mixing of linear AR processes.
MAR Parameter Estimation Using EM • E-step: • Conditional probability sample came from mixture component : • where: • In a GMM-HMM, for , while corresponds to the mean. This reduces to the familiar EM equations.
MAR Parameter Estimation Using EM • M-step: • Parameter Updates • Again, for GMM-HMM, for , and is the component mean, reducing to the familiar EM equations.
Vector Extension of MAR • Basic formulation of MAR was only for scalar time series. • How can we extend MAR to handle vector time series (e.g. MFCCs) without fundamental modifications to the EM equations? • Solution: Tie MAR mixture components across scalar components using a single weight estimated by combining likelihood of all scalar components for each mixture: • Pros: Only simple modification to weight update during EM training. • Cons: Assumes scalar components are decoupled (except for the shared mixture weight parameter). This is similar to an assumption of a diagonal covariance in GMMs – a very common assumption for MFCC features.
Relationship to Other AR-HMMs • (Poritz, 1980) Linear Predictive HMM: Equivalent to MAR-HMM with a single mixture component. • (Juang and Rabiner, 1985) Mixture Autoregressive HMM: Somewhat similar to the current model investigated (even identical in name!), but with two major differences: • Component means were assumed to be zero. • Variances were assumed to be equal across components. • (Ephraim and Robert, 2005) Switching AR-HMM: Restricted to single-component MAR-HMM. • All the above were applied to scalar time series with speech samples as observations. • Our vector extension of MAR has been applied to modeling of the MFCC vector feature stream.
(Synthetic) Signal Classification – Experimental Design • Two-class problem • Same marginal distribution • Different dynamics • Signal 1: no dependence on past samples • Signal 2: nonlinear dependence on past samples • Signal parameters for the two classes:
(Synthetic) Signal Classification – Results • Table 1: Classification (% accuracy) results for synthetic data • Static-only features: GMM can only do as good as a random guess (because both signals have the same marginal); MAR uses the distinct signal dynamics to yield 100% classification accuracy. • Static+∆+∆∆ :GMM performance improves significantly due to the addition of dynamic information. Yet, MAR outperforms GMM because it can capture nonlinear dynamics in the signals.
Sustained Phone Classification – Experimental Design • Collected artificially elongated pronunciations of three sounds: vowel /aa/, nasal /m/, sibilant /sh/, sampled at 16 kHz. • For this preliminary study, we wanted to avoid artifacts introduced by coarticulation. • Static feature database: 13 MFCCs (including energy) extracted at a frame rate of 10 ms and window duration of 25 ms. • Static+∆+∆∆ database: Static + velocity + acceleration MFCC coefficients. • Training: 70 recordings of each phone. • Testing: 30 recordings of each phone.
Sustained Phone Classification (Static Features)– Results • Table 2: Sustained phone classification (% accuracy) results with MAR and GMM using static (13) MFCC features • For an equal number of parameters, MAR outperforms GMM. • MAR exploits dynamic information to better model evolution of MFCCs for phones that GMM is unable to model.
Sustained Phone Classification (Static +∆+∆∆ ) – Results • Table 3: Sustained phone classification (% accuracy) results with MAR and GMM using static+∆+∆∆ (39) MFCC features • MAR performs better than GMM when the number of parameters are equal. • MAR performance saturates as the number of parameters increases. • Why?: Assumption that features are uncorrelated during MAR training is invalid. Static features may be unrelated but obviously ∆ features are related to static features. Causes problems both for diagonal covariance GMMs and MAR, but more severe for MAR due to its explicit modeling of dynamics.
Summary and Future Work • Conclusions: • Presented a novel nonlinear mixture autoregressive HMM (MAR-HMM) for speech recognition with a vector extension. • Demonstrated the efficacy of MAR-HMM over GMM-HMM for signal classification with a two-class problem using synthetic signals. • Showed the superiority of MAR-HMM over GMM-HMM for sustained phone classification using static MFCC features. • Experiments with static+ ∆+∆∆ features were not as conclusive – MAR-HMM performance saturated more quickly than GMM-HMM. • Future Work: • Perform large-scale phone recognition experiments to study MAR-HMM performance in more practical scenarios involving noise (e.g. Aurora-4). • Investigate whether phone-dependent decorrelation (class-based PCA) of the features improves MAR-HMM performance.
References • D. May, Nonlinear Dynamic Invariants For Continuous Speech Recognition, M.S. Thesis, Department of Electrical and Computer Engineering, Mississippi State University, May 2008. • Wong, C. S. and Li, W. K., “On a Mixture Autoregressive Model,” Journal of the Royal Statistical Society, vol. 62, no. 1, pp. 95‑115, Feb. 2000. • Poritz, A. B., “Linear Predictive Hidden Markov Models,” Proceedings of the Symposium on the Application of Hidden Markov Models to Text and Speech, Princeton, New Jersey, USA, pp. 88-142, Oct. 1980. • Juang, B. H., and Rabiner, L. R., “Mixture Autoregressive Hidden Markov Models for Speech Signals,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 6, pp. 1404-1413, Dec. 1985. • Ephraim, Y. and Roberts, W. J. J., “Revisiting Autoregressive Hidden Markov Modeling of Speech Signals,” IEEE Signal Processing Letters, vol. 12, no. 2, pp. 166-169, Feb. 2005.