330 likes | 541 Views
Robust HMM classification schemes for speaker recognition using integral decode. Marie Roch Florida International University. Who am I?. Speaker Recognition. Types of speaker recognition. . Speaker Recognition. Why is it hard? Minimal training data Background noise Transducer mismatch
E N D
Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University
Speaker Recognition • Types of speaker recognition
Speaker Recognition • Why is it hard? • Minimal training data • Background noise • Transducer mismatch • Channel distortions • People’s voices change over time and under stress • Performance
Feature Extraction • Extract speech • Spectral analysis • Cepstrum: • Cepstral means removal
Hidden Markov Models • Statistical pattern recognition • State dependent modeling • Distribution/state • Radial basis functions common • State sequence unobservable
HMM • Efficient decoders: • Training • EM algorithm • Convergence to local maxima guaranteed
Recognition • Model for each speaker • Maximum a priori (MAP) decision rule Arg Max Scores Features Models
The MAP decision rule • Optimal decision rule provided we have accurate distribution parameters & observations. • Problem: • Corruption of feature vectors. • Distribution known to be inaccurate.
Integral decode • Goal: Include uncorrupted observation ôt. • Problem: ôt unobservable. • Determine a local neighborhood t about ot and use a priori information to weight the likelihood:
Integral decode issues • Problems approximating the integral • High frame rate * number of models • Non-trivial dimensionality • Selection of the neighborhood
Approximating the integral • Monte Carlo impractical • Use simplified cubature technique:
Neighborhood choice • Choosing an appropriate neighborhood: • Upper bound difference neighborhoods [Merhav and Lee 93] • Error source modeling
Upper bound difference neighborhoods • Arbitrary signal pairs with a few general conditions. • PSD • Cepstra
Taking the upper bound • Asymptotic difference between cepstral parameters:
Error source modeling • Multiple error sources • Simplifying assumption of one normal distribution with zero mean • Use time series analysis to estimate the noise • Trend
Error Source Modeling • Estimate variance from detrended signal
Error source modeling • Problem: • is infinite • Solution: • Most of the points are outliers • Set percentage of distribution beyond which points are culled.
Complexity of integration • Expensive • Ways to reduce/cope • Implemented • Top K processing • Principle Components Analysis • Possible • Gaussian Selection • Sub-band Models • SIMD or MIMD parallelism
Top K Processing 1 second 3 seconds 5 seconds
Principal Component Analysis • Choose P most important directions
Principal Component Analysis • Integrate using new basis set for step function
Speech Corpus • King-92 • Used San Diego subset • 26 male speakers • Long distance telephone speech • Quiet room environment • 5 sessions recorded one week apart • 1-3 train • Sessions 4-5 partitioned into test segments
1 second 3 seconds 5 seconds Integral decode performance
Integral decode with other conditions • Performance on • high quality speech • transducer mismatch
Future work • Extensions to the integral decode • Automatic parameter selection • Gaussian selection • distributed computation • Efficient multiple class preclassifiers
Optimal/utterance hyperparameters – 5 seconds KingNB26 KingWB51 SpidreF18XDR SpidreM27XDR
95% Confidence Intervals • Caveat: • Per speaker means • Large granularity
Pattern Recognition • Long term statistics [Bricker et al 71, Markel et al 77] • Vector Quantization [Soong et al 87] • HMM [Rosenberg et al 90, Tishby 91, Matsui & Furui 92, Reynolds et al 95] • Connectionist frameworks • Feed forward [Oglesby & Mason 90] • Learning vector quantization [He et al 99]
Pattern Recognition Contd. • Hybrid/Modified HMMs • Min Classification Error discriminant [Liu et al 95] • Tree structured neural classifiers [Liou & Mammone 95] • Trajectory modeling [Russell et al 85, Liu et al 95, Ostendorf et al 96, He et al 99] • Sub-band recognition [Besacier & Bonastre 97]