330 likes | 401 Views
Fast Scoring for Mixture of PLDA in I-Vector/PLDA Speaker Verification. Man-Wai Mak. APSIPA 2015. Department of Electronic and Information Engineering The Hong Kong Polytechnic University. Contents. Motivation of Work Conventional PLDA vs. Mixture of PLDA Fast Scoring for Mixture of PLDA
E N D
Fast Scoring for Mixture of PLDA in I-Vector/PLDA Speaker Verification Man-Wai Mak APSIPA 2015 Department of Electronic and Information Engineering The Hong Kong Polytechnic University
Contents • Motivation of Work • Conventional PLDA vs. Mixture of PLDA • Fast Scoring for Mixture of PLDA • Experiments on NIST 2012 SRE • Conclusions 2
Motivation PLDA Model PLDA Score Enrollment i-vectors Conventional i-vector/PLDA systems use a single PLDA model to handle all SNR conditions.
Motivation PLDA Model 1 PLDA Score PLDA Model 2 PLDA Score PLDA Model 3 PLDA Score We argue that a PLDA model should focus on a small range of SNR.
Proposed Solution PLDA Model 1 PLDA Score SNR Estimator PLDA Model 2 SNR Posterior Estimator PLDA Model 3 M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-Vector Speaker Verification", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-0142, Jan. 2016. The full spectrum of SNRs is handled by a mixture of PLDA in which the posteriors of the indicator variables depend on the utterance’s SNR (Mak, Interspeech14, Mak et al. T-ASLP 16)
Key Features of Proposed Solution • It was found that the performance of mixture of PLDA is much better than the conventional PLDA when the test utterances exhibit a wide range of SNR. • However, the scoring function of this model is significantly more complex than the conventional PLDA. • This paper proposes a method to reduce the scoring time by up to 60%.
Contents • Motivation of Work • Conventional iVector-PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
I-Vectors • A low dimension representation of the entire utterance. • Factor analysis model: Speaker- and channel-dependent latent factor Speaker- and channel-dependent supervector Low-rank total variability matrix UBM supervector • Given T and an utterance of speaker s, the posterior mean of the latent factor xs is the i-vector representing speaker s • Do the same for test speakers. • Totally unsupervised • I-vectors contain both speaker and channel information
Probabilistic LDA (PLDA) Residual noise with covariance Σ Speaker factor i-vector extracted from the j-th session of the i-th speaker Global mean of all i-vectors Low-rank Speaker factor loading matrix • V is trained by using the i-vectors of many speakers, each has multiple sessions. • Speaker labels are used in the training • Aim to suppress channel effect on the verification scores • In PLDA, the i-vectors x are modeled by a factor analyzer of the form:
Contents • Motivation of Work • Conventional PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
Mixture of PLDA (mPLDA) For modeling SNR of utts. For modeling SNR-dependent i-vectors • Generative Model: I-vector SNR (dB) • Model Parameters: 2
Graphical Model of mPLDA SNR of the j-th utterance from the i-th speaker For modeling SNR of utts. For modeling SNR-dependent i-vectors 2
Likelihood-Ratio Scores of mPLDA • Different-speaker likelihood: Same-speaker likelihood • Verification Score = Different-speaker likelihood #For full derivation, see http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf 14
Complexity Analysis Dimension of i-vectors 15
Sparseness Analysis of SNR Posteriors • Key idea: If the posterior probabilities of SNR are sparse, we may drop the combinations of that lead to small posterior 16
Sparseness Analysis of SNR Posteriors Combination of target-speaker utterances and test utterances pairs, sorted by SNR posterior prob. 17
PLDA vs. Fast mPLDA Scoring • PLDA: • Complexity: • Fast mPLDA: • Complexity: 20
Contents • Motivation of Work • Conventional PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
Experiments Evaluation dataset:Common evaluation conditions 3 and 4 of NIST SRE 2012 core set. Parameterization: 19 MFCCs together with energy plus their 1st and 2nd derivatives 60-Dim UBM: gender-dependent, 1024 mixtures Total Variability Matrix: gender-dependent, 500 total factors I-Vector Preprocessing: Whitening by WCCN then length normalization Followed by LDA (500-dim 200-dim) and WCCN PLDA and mPLDA with 150 speaker factors
Evaluation Conditions CC3 CC4
Comparing Scoring Time Common Condition 3 EER Scoring Time (sec.) EER (%) Scoring Time K = 4 K = 3 K = 2
Comparing Scoring Time Common Condition 4 EER Scoring Time (sec.) EER (%) Scoring Time K = 4 K = 3 K = 2
Conclusions • Mixture of SNR-dependent PLDA (mPLDA) is a flexible model that can handle noisy speech with a wide range of SNR • This paper reduces the scoring time of mPLDA by half with minor degradation in performance. • This is achieved by omitting the computation of likelihood terms whose corresponding SNR posterior probabilities are small. • Further information: • http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf
Distribution of SNR in SRE12 Each SNR region is handled by a PLDA Model
Likelihood-Ratio Scores of mPLDA • Same-speaker likelihood: SNR of target and test utterances i-vectors of target and test speakers
Training Data • In NIST 2012 SRE, training utterances from telephone channels are clean, but some of the test utterances are noisy. • We used the FaNT tool to add babble noise to the clean training utterances Babble noise Utterances from microphone channels FaNT From telephone channels