130 likes | 413 Views
Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000. Presented by 王瑞璋 Nick Wang Philips Research East Asia-Taipei Speech processing laboratory, NTU 25 October 2000. Speaker identification and verification.
E N D
Speaker identification and verification using EigenVoicesO. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented by 王瑞璋 Nick Wang Philips Research East Asia-Taipei Speech processing laboratory, NTU 25 October 2000
Speaker identification and verification • Speaker identification • to identify the speaker, as one of the clients, via speech input • Speaker verification • to verify the speaker, as the claimed one, via speech input • Problem definition: the amount of available data is limited for each speaker • 60 seconds ==> enough to train GMM • 5 seconds ==> not enough to train GMM, but enough to estimate EigenVoices coefficients • Aim: to incorporate EigenVoices into GMM speaker modeling
When GMM meets EigenVoices • GMM • one mixture Gaussian p.d.f. per client • for example, 32 Gaussian multi-variant p.d.f. in a GMM • Given acoustic feature vector of 26 components (13+13) • Model size: 32 x 26 = 832 variables • EigenVoices -- principle axes of GMM parameter supervectors • to reduce the dimensionality of GMM model by PCA, LDA, or MLES • to eliminate the effect of estimation error (noise) by removing the axes with lower variation (signal) ==> subspace selection with SNR > threshold (nick) • or fixed dimension of EigenVoices space: 20 to 70 EigenVoices (higher variation axes) • speaker location in EigenVoices space ==> reconstruct adapted GMM • Model size: 20 to 70 variables
When GMM meets EigenVoices • Benefit -- principle axes • robust & fast: keep higher variation axes ==> produce less estimation error; show improvement immediately • obvious & small speaker distribution representation (v.s. MAP or MLLR) • more applications: e.g. SPID, telephony, embedded system, ... • Corpora • Extra training data (to train SI model and/or EigenVoices) • large-amounts of data from a large and diverse set of speakers • Client data (to train client models) • small-amounts of data per speaker • Test set • small-amounts of data per speaker (from clients or imposters)
When GMM meets EigenVoices • Training procedure • train GMMs for each speaker in extra training data • large-amounts data per speaker • train EigenVoices (principle axes of GMMs) using PCA, LDA or MLES • on model parameters supervectors • apply environmental adaptation to all EigenVoices by client data using MLLR • by all client data • apply MLED to estimate eigen-coefficients for each client • small-amounts data per speaker • compose client models for each client by EigenVoices & their coefficients
Speaker identification/verification • Measurement • eigenDistance decoding: eigenDist(test, client) • test speaker’s distance from client speaker in eigenspace • eigenGMM decoding: eigenGMMclient(test) • test speaker’s likelihood of client speaker eigen-adapted GMMs • Speaker identification • decision(test) = argminclient eigenDist(test, client) • or decision(test) = argmax eigenGMMclient(test) • Speaker Verification • decision(test,claim) = accept if eigenDist(test, claim) < thr, otherwise reject • or decision(test,claim) = accept if eigenGMMclaim(test) > thr, otherwise reject
Experiments • Setup • Corpora • TIMIT: mismatched extra training data, 630 speakers x 10 sentences • YOHO: extra training, client and test data, 82 speakers x 96 sentences • Results for abundant (360 sec) enrollment data in SPID • 82 clients of 360 seconds enrollment speech • 5 seconds test speech • GMM: 98.8% correct identification • No eigenGMM model is better than GMM under the constraint of at most 71 EigenVoices. • Since: enough enrollment data, and constrained 71/832 axes. • The best is 98.0% with LDA EigenVoices, 71 (the most) axes, eigenGMM decoding.
Experiments • Results for sparse (10 sec) enrollment data in SPID
Experiments • Results for sparse (10 sec) enrollment data in speaker verification • SI impostor model for eigenGMM decoding • 40 EigenVoices on 64-GMMs supervectors over 72 speakers • EigenVoices helps • LDA-EigenVoices • eigenDistance
Experiments • Results for matched/mismatched extra training data in SPID • MLLR adaptation helps to solve environment mismatch. • TIMIT is not suitable for LDA-EigenVoices because of: • 10 sentences per speaker • more allophonic variability
Conclusions • EigenVoices provides a confined subspace. • For abundant client data, it is worse than conventional GMM because of the loss of degrees of freedom. • For sparse client data, it performs better than conventional GMM. • In the case of eigenDistance speaker verification, there is no need for an impostor model to normalize for utterance likelihood dependencies • eigenspace itself implicitly normalizes for utterance likelihood: two utterances with very different likelihood may map to the same point in the eigenspace. • Environment mismatch will hurt the client models. • even applied MLLR adaptation • LDA for EigenVoices generation will not work if • there are less utterances per speaker • or there are strong allophonic variability
Comments & my future work • Since EigenVoices is a confinement, can we enlarge speaker models before applying it ? • GMM: no use of fine speech structure • LVCSR (segmentation => adaptation => SA score difference from SI one) : using of speech structure info hurt speaker recognition performance • Sequential Non-Parametric (SNP) or DTW distances: SNP+GMM work best in all • To try EigenMLLRs speaker recognition • 1/1000 memory requirement of EigenVocies • Separate test data to several fragments, each one is very small • eigenspace decoding • joint decision