A NONLINEAR MIXTURE AUTOREGRESSIVE MODEL FOR SPEAKER VERIFICATION

A NONLINEAR MIXTURE AUTOREGRESSIVE MODEL FOR SPEAKER VERIFICATION • Author: SundararajanSrinivasan Dept. Electrical and Computer Eng. Mississippi State University • Email: ss754@ece.msstate.edu

Speaker Verification Overview • Speaker Recognition • Speaker Identification • Speaker Verification • Speaker Verification • Accept or Reject identity claim made by a speaker (a Binary Decision) • Applications: Secured access, surveillance, multimodal authentication

Speaker Verification Performance Measures • Two Kinds of Errors • False Alarms: Imposter is accepted • Misses: True speaker is rejected • Threshold value determines operating point. By varying the value of threshold, one error can be reduced at the expense of the other. • Detection Error Tradeoff (DET) Curve: Graph with false alarms on x-axis and misses on y-axis. • Model A better than Model B if DET curve of A lies consistently closer to origin than that of B. • Scalar performance measures more convenient. • Equal Error Rate (EER): Point at which line with slope 1 and passing through origin intersects DET curve. • Weights false alarms and misses equally.

Features for Speech • Speaker Verification is a pattern classification problem. • Requires features to represent information in class data. • Speech Mel-Frequency Cepstral Coefficients (MFCCs) • Most popular in speech and speaker recognition applications. • Physically motivated based on auditory perception properties of the human ear. • Both absolute MFCCs as well as their dynamics are considered to be very useful in speech and speaker recognition.

Nonlinearities in speech • Traditional speech representation and modeling approaches were restricted to linear dynamics. • Recent studies indicate significant nonlinearities are present in speech signal that could be useful in speech and speaker recognition, especially under noisy and mismatched channel conditions. • Most research striving to utilize nonlinear information in speech use nonlinear dynamic invariants as additional features. • These have found limited success and have failed to show improvements in noisy conditions because of: • Difficulty in parameter estimation from short-time segments. • Inadequacy in representing the actual nonlinear dynamics. • Capturing nonlinearities at the modeling level is desirable.

Gaussian Mixture Modeling (GMM) – The Tradition • A random variable x drawn from a Gaussian Mixture Model has a probability density function defined by: • An equivalent formulation of a GMM: where is a Gaussian r.v. with mean 0 and covariance .

GMM in speech processing systems and its limitations • GMM is the primary statistical representation for speech signals currently. • Can be easily incorporated into a Hidden Markov Model (HMM). • Found to be very successful in speech recognition but offers no improvement over GMMs in speaker recognition. • ML estimation with Expectation-Maximization algorithm is quick to converge (typically 3-4 iterations). • In speaker recognition, each Gaussian represents a different broad phone class of sounds produced by a speaker. Since the same phoneme is pronounced differently by different speakers, the GMMs of speakers are dissimilar. • Main Drawback: It is a model for RV, so cannot model dynamics in feature streams. • Old Solution: Use differential MFCC features to represent dynamics; append with absolute features, and model using GMM. • Main Drawbacks: Differential features are only a linear approximation to the nonlinear dynamics; Redundancy is present in combined features. • Proposed Solution: Use a nonlinear model to capture the static as well as nonlinear dynamic information in speech MFCC streams.

Mixture Autoregressive Model (MixAR) – A Nonlinear Model • A mixture autoregressive process (MixAR) of order p with m components, X={x[n]}, is defined as :

MixAR in Speech Processing • MixAR is distinct from other autoregressive and mixture autoregressive models found in speech literature. • This is the more general than all other mixture autoregressive models found in speech literature. • ML parameter estimation can be achieved using Generalized EM algorithm • Generalized EM: At each iteration, the likelihood is not maximized, but the algorithm moves along the direction of increasing likelihood. • Probabilistic mixing of AR processes implies nonlinearity in the model. • Hope to capture both static as well as dynamics in speech signals using absolute MFCCs alone. • MixAR in Speech Processing – what to expect? • Use only static MFCCs with MixARs to perform as well as GMMs using static+differential features. • Remove feature redundancy = > Fewer parameters • Using nonlinear information in speech, MixAR performs better than GMM especially under noisy conditions.

Preliminary Experiment I Two-Way Classification with Synthetic Data • Model for Class 1 data • (Linear Dynamics) • Model for Class 2 data • (Nonlinear Dynamics) • Classification Error Rate (%) • (number of parameters in • paranthesis)

Preliminary Experiment II • Two-Way Classification with Speech-Like Data • Two speakers from NIST 2001 database were chosen • X1: Data with linear dynamics generated from trained HMMs (3 states, 4 Gaussian mixtures per state). • X2: Data with nonlinear dynamics generated from trained MixAR (32 mixtures) . • A range of signals with varying degrees of nonlinearity generated using: • Classification Error Rate (%) • (number of parameters in • paranthesis)

Preliminary Experiment III • Speaker Verification Experiments with Synthetic Data • All 60 speakers from development part of NIST2001 SRE Corpus were used. • Clean Data • Linear Data: generated from trained HMMs (3 states, 4 Gaussian mixtures per state). • Nonlinear Data: generated from trained MixAR (32-mix). • Noisy Data • Clean utterances corrupted with 5 dB car noise audio using FANT software. • Linear and Nonlinear data generated as above. • Results • Not much difference in performance between GMM and MixAR if data are either linear or clean. But if data are both noisy and nonlinear a significant difference in performance was found!

Preliminary Experiment III Speaker Verification Experiments with Synthetic Data • MixAR using fewer parameters and only static features does consistently better than GMM using more parameters and static+delta features.

Speaker Verification Experiment with NIST 2001 • All 60 speakers from development part of NIST2001 SRE Corpus were used. • EER performance for different feature combinations • Adding delta features does not help MixAR but helps GMM. • MixAR with only static features does better than GMM with static+delta features. • Adding energy feature degrades performance.

Speaker Verification Experiment with NIST 2001 • EER performance as a function of number of mixtures • MixAR with only static features does better than GMM with static+delta features. • MixAR uses fewer parameters to achieve better performance than GMM.

Speaker Verification Experiment with NIST 2001 • DET curves • MixAR using fewer parameters and only static features does consistently better than GMM using more parameters and static+delta features.

Speaker Verification Experiment with TIMIT • All 168 speakers in the core test set were used. • 5 utterances were used to train each speaker model and the remaining 5 for evaluation. • EER performance as a function of number of mixtures • MixAR using fewer parameters and only static features does better than GMM using more parameters and static+delta features.

Proposed Experiments I • Speaker Verification Experiments with TIMIT under Noisy Conditions • This is to study noise robustness of MixAR in comparison to GMM. • Add several kinds of noise to TIMIT database at various SNR levels and study speaker verification performance. • The final table would look like this ----------->

Proposed Experiments II • Speaker Verification Experiments with Variable Amounts of Training Data • This is to study effects of training and evaluation utterance durations on verification performance. • Important to study if MixAR is applicable when utterances are short. • The final table would look like this ----------->

Other Important Issues • Study computational complexity of both MixAR and GMM, especially for the evaluation stage which needs to be done near real-time while training is typically offline. • Extend the concept of universal background models (UBM) in GMMs to MixAR by deriving speaker adaptation techniques. This can help training models for speakers with very little data. • Design discriminative approaches to MixAR training parallel to those for GMM and note if performance of MixAR is improved. • Extend the applicability of MixAR to other speech processing problems – particularly speech recognition. This is perhaps the most important though also difficult step in establishing MixAR as a superior alternative to GMM.

Proposed Contents of My Dissertation • Chapter 1: Introduction to theSpeaker Verification Problem • Background information about speaker verification problem, traditional approaches to modeling and feature extraction. • Chapter 2: Nonlinearity of Speech Signals • Survey of evidence for nonlinear dynamics in speech and how including this information can help improve robustness of speech processing systems. • Chapter 3: The Gaussian Mixture Model • Background information, application to speech, and drawbacks. • Chapter 4: The Mixture Autoregressive Model • Motivation, background information, and what we expect from MixAR in speech processing systems. • Chapter 5: Pilot Experiments with Synthetic data • Chapter 6: Speaker Verification Experiments with speech data • Experiments on NIST 2001, TIMIT, noise-performance, and training/evaluation utterance duration effects on performance. • Chapter 7: Conclusions

Summary of proposed future work • Noise Performance: • It is not certain whether MixAR, using fewer parameters, would outperform GMM in noisy environments for real –speech data. • It is very important to study noise robustness of MixAR in comparison to that of GMM because noise is ever-present in real-life and a terrible limitation to current systems. • A detailed study across different realistic noise scenarios and at various noise levels needs to be conducted. • If this is successful, this would be a significant contribution to speech processing, opening a new line of modeling approach. • Training/Evaluation Utterance Duration Effects on Performance • MixAR being a nonlinear model and using a more complicated GEM training approach, could require more data for processing. • In several practical speaker recognition applications, speaker data is very limited. • This necessitates study of how duration affects training and evaluation performance for MixAR and GMM.

Brief Bibliography • X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice-Hall, 2001 • D. A. Reynolds, “Speaker Identification and Verification using Gaussian Mixture Speaker Models”, Speech Communication, vol. 17, no. 1-2, pp. 91-108, 1995. • D. May, Nonlinear Dynamic Invariants For Continuous Speech Recognition, M.S. Thesis, Department of Electrical and Computer Engineering, Mississippi State University, USA, May 2008. • M. Zeevi, R. Meir, and R. Adler, “Nonlinear Models for Time Series using Mixtures of Autoregressive Models”, Technical Report, Technion University, Israel, available online at: http://ie.technion.ac.il/~radler/mixar.pdf, October 2000. • C. S. Wong, and W. K. Li, “On a Mixture Autoregressive Model,” Journal of the Royal Statistical Society, vol. 62, no. 1, pp. 95‑115, February 2000. • S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone, "Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition," Proceedings of the International Conference on Spoken Language Processing, pp. 960-963, Brisbane, Australia, September 2008.

Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkit Available Resources

Publications • S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone, "Nonlinear Statistical Modeling of Speech," presentated at the 29th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2009), Oxford, Mississippi, USA, July 2009. • S. Srinivasan, T. Ma, D. May, G. Lazarou and J. Picone, "Nonlinear Mixture Autoregressive Hidden Markov Models for Speech Recognition," Proceedings of the International Conference on Spoken Language Processing, pp. 960-963, Brisbane, Australia, September 2008. • S. Prasad, S. Srinivasan, M. Pannuri, G. Lazarou and J. Picone, “Nonlinear Dynamical Invariants for Speech Recognition,” Proceedings of the International Conference on Spoken Language Processing, pp. 2518-2521, Pittsburgh, Pennsylvania, USA, September 2006. • In Preparation • S. Srinivasan, G. Lazarou and J. Picone, “Nonlinear Mixture Autoregressive Modeling for Robust Speaker Verification,” IEEE Transactions on Audio, Speech and Language Processing, to be submitted (expected March, 2010).

A NONLINEAR MIXTURE AUTOREGRESSIVE MODEL FOR SPEAKER VERIFICATION