370 likes | 470 Views
Speaker ID Smorgasbord or How I spent My Summer at ICSI. Kofi A. Boakye International Computer Science Institute. Outline. Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions. Keyword System: A Review. Motivation
E N D
Speaker ID Smorgasbordor How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute Speech Group Lunch Talk
Outline • Keyword System • Enhancements • Monophone System • Hybrid HMM/SVM • Score Combinations • Possible Directions Speech Group Lunch Talk
Keyword System: A Review • Motivation • Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems Capitalize on advantages of text-dependent systems in this text-independent domain by limiting words of interest to a select group: Backchannels (yeah, uhhuh) , filled pauses (um, uh), discourse markers (like, well, now…) => high frequency and high speaker-characteristic quality II. GMMs assume frames are independent and fail to take advantage of sequential information • => Use HMMs instead to model the evolution of speech in time Speech Group Lunch Talk
Keyword System: A Review • Approach • Model each speaker using a collection of keyword HMMs • Speaker models generated via adaptation of background models trained from a development data set • Use standard likelihood ratio approach: • Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs • Use a speech recognizer to: • Locate words in the speech stream • Align speech frames to the HMM • Generate acoustic likelihood scores HMM-UBM 1 Word Extractor HMM-UBM 2 signal Combination HMM-UBM N Speech Group Lunch Talk
Keyword System: A Review Keywords Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} Filled pauses: {um, uh} Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } Keyword Models Simple left-to-right (whole word) HMMs with self-loops and no skips 4 Gaussian components per state Number of states related to number of phones and median number of frames for word HMMs trained and scored using HTK Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences Speech Group Lunch Talk
System Performance Switchboard 1 Dev Set Data partitioned into 6 splits Tests use jack-knifing procedure: Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa) For development, tested primarily on split 1 with 8-side training Result:EER = 0.83% Speech Group Lunch Talk
System Performance • Observations: • Well-performing bigrams have comparable EERs • Poorly-performing bigrams suffer from a paucity of data • Suggests possibility of frequency threshold for performance • Single word ‘yeah’ yields EER of 4.62% Speech Group Lunch Talk
Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides Speech Group Lunch Talk
Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • { and, I , that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at } Speech Group Lunch Talk
Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • Min set: 11 words that yield the lowest word-specific EERs Speech Group Lunch Talk
Enhancements: Words • Examine the performance of other words • Sturim et al. propose word sets for text-constrained GMM system • Full set: 50 words that occur in > 70% of conversation sides • Min set: 11 words that yield the lowest word-specific EERs • {and, I, that, yeah, you, just, like, uh, to, think, the} Speech Group Lunch Talk
Enhancements: Words Performance Full set: EER = 1.16% My set Full set = { yeah, like, uh, well, I, think, you } Speech Group Lunch Talk
Enhancements: Words • Observations: • Some poorly performing words occur quite frequently • Such words may simply not be highly discriminative in nature • Single word ‘and’ yields EER of 2.48% !! Speech Group Lunch Talk
Enhancements: Words Performance Min set: EER = 0.99% My set Min set = {yeah, like, uh, I, you, think} Speech Group Lunch Talk
Enhancements: Words Observations: Except for ‘and’, min set words have comparable performance Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction Speech Group Lunch Talk
Enhancements: HNorm Target model scores have different distributions for utterances based on handset type LR scores HNorm Scores • Perform mean and variance normalization of scores based on estimated impostor score distribution • For split 1, use impostor utterances from splits 2 and 3 • 75 females • 86 males elec tgt1 carb elec tgt2 carb Speech Group Lunch Talk
Enhancements: HNorm Performance EER = 1.65% Performance worsened! Possible issue in HNorm implementation? Speech Group Lunch Talk
Enhancements: HNorm Examine effect of HNorm on particular speaker scores Speakers of interest: Those generating the most errors 3 Speakers each generating 4 errors Speech Group Lunch Talk
Enhancements: HNorm Speech Group Lunch Talk
Enhancements: HNorm Speech Group Lunch Talk
Enhancements: HNorm Speech Group Lunch Talk
Enhancements: HNorm Conclusion: HNorm works…but doesn’t One possibility: Look at computed devs… Distributions are widening in some cases Speech Group Lunch Talk
Enhancements: Deltas Problem: System performance differs significantly by gender Hypothesis: Higher deltas for females may be noisier Solution: Use longer window for delta computation to smooth Speech Group Lunch Talk
Enhancements: Deltas Extended window size from 2->3 Result: EER = 0.83% Performance nearly indistinguishable Speech Group Lunch Talk
Enhancements: Deltas Extended window size from 2->3 Result: Male and female disparity remains Speech Group Lunch Talk
Enhancements: Deltas Extended window size from 3->5 Result: EER = 1.32% Performance worsens! Speech Group Lunch Talk
Enhancements: Deltas Extended window size from 3->5 Result: Male female disparity widens Further investigation necessary Speech Group Lunch Talk
Monophone System Motivation Keyword system, with its use of HMMs, appears to have good performance However, we are only using a small amount (~10%) of the total data available =>Get full coverage by using phone HMMs rather than word HMMs System represents a trade-off between token coverage and “sharpness” of modeling Speech Group Lunch Talk
Monophone System • Implementation • System implemented similarly to keyword system, with phones replacing words • Background models differ in that: • All models have 3 states, with 128 Gaussians per state • Models trained by successive splitting and Baum-Welch re-estimation, starting with a single Gaussian Speech Group Lunch Talk
Monophone System Performance EER = 1.16% Similar performance to keyword system Uses a lot more data! Speech Group Lunch Talk
Hybrid HMM/SVM System • Motivation SVMs have been shown to yield good performance in speaker recognition systems Features used: • Frames • Phone and word n-gram counts/frequencies • Phone lattices Speech Group Lunch Talk
Hybrid HMM/SVM System Motivation Keyword system looks at “distance” between target and background models as measured by log-probabilities Look at distance between models more explicitly => Use model parameters as features Speech Group Lunch Talk
Hybrid HMM/SVM System Approach Use concatenated mixture means as features for SVM Positive examples obtained by adapting background HMM to each of 8 training conversations Negative examples obtained by adapting background HMM to each conversation in the background set Keyword-level SVM outputs combined to give final score -Presently simple linear combination with equal weighting is used (though clearly suboptimal) Speech Group Lunch Talk
Hybrid HMM/SVM System Performance EER = 1.82% Promising first start Speech Group Lunch Talk
Score Combination We have three independent systems, so let’s see how they combine… Perform post facto (read: cheating) linear combination Each best combination yields same EER =>Possibly approaching EER limit for data set Speech Group Lunch Talk
Possible Directions • Develop on SWB2 • Create word “master list” for keyword system • TNorm • Modify features to address gender-specific performance disparity • Score combination for hybrid system • Modified hybrid system • Tuning • Plowing Speech Group Lunch Talk
Fin Speech Group Lunch Talk