330 likes | 410 Views
This paper introduces a method to accelerate scoring in the Mixture of PLDA model for noise-robust speaker verification, reducing computation time by up to 60%. The model is shown to outperform conventional PLDA for a wide range of SNR conditions.
E N D
Fast Scoring for Mixture of PLDA in I-Vector/PLDA Speaker Verification Man-Wai Mak APSIPA 2015 Department of Electronic and Information Engineering The Hong Kong Polytechnic University
Contents • Motivation of Work • Conventional PLDA vs. Mixture of PLDA • Fast Scoring for Mixture of PLDA • Experiments on NIST 2012 SRE • Conclusions 2
Motivation PLDA Model PLDA Score Enrollment i-vectors Conventional i-vector/PLDA systems use a single PLDA model to handle all SNR conditions.
Motivation PLDA Model 1 PLDA Score PLDA Model 2 PLDA Score PLDA Model 3 PLDA Score We argue that a PLDA model should focus on a small range of SNR.
Proposed Solution PLDA Model 1 PLDA Score SNR Estimator PLDA Model 2 SNR Posterior Estimator PLDA Model 3 M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-Vector Speaker Verification", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-0142, Jan. 2016. The full spectrum of SNRs is handled by a mixture of PLDA in which the posteriors of the indicator variables depend on the utterance’s SNR (Mak, Interspeech14, Mak et al. T-ASLP 16)
Key Features of Proposed Solution • It was found that the performance of mixture of PLDA is much better than the conventional PLDA when the test utterances exhibit a wide range of SNR. • However, the scoring function of this model is significantly more complex than the conventional PLDA. • This paper proposes a method to reduce the scoring time by up to 60%.
Contents • Motivation of Work • Conventional iVector-PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
I-Vectors • A low dimension representation of the entire utterance. • Factor analysis model: Speaker- and channel-dependent latent factor Speaker- and channel-dependent supervector Low-rank total variability matrix UBM supervector • Given T and an utterance of speaker s, the posterior mean of the latent factor xs is the i-vector representing speaker s • Do the same for test speakers. • Totally unsupervised • I-vectors contain both speaker and channel information
Probabilistic LDA (PLDA) Residual noise with covariance Σ Speaker factor i-vector extracted from the j-th session of the i-th speaker Global mean of all i-vectors Low-rank Speaker factor loading matrix • V is trained by using the i-vectors of many speakers, each has multiple sessions. • Speaker labels are used in the training • Aim to suppress channel effect on the verification scores • In PLDA, the i-vectors x are modeled by a factor analyzer of the form:
Contents • Motivation of Work • Conventional PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
Mixture of PLDA (mPLDA) For modeling SNR of utts. For modeling SNR-dependent i-vectors • Generative Model: I-vector SNR (dB) • Model Parameters: 2
Graphical Model of mPLDA SNR of the j-th utterance from the i-th speaker For modeling SNR of utts. For modeling SNR-dependent i-vectors 2
Likelihood-Ratio Scores of mPLDA • Different-speaker likelihood: Same-speaker likelihood • Verification Score = Different-speaker likelihood #For full derivation, see http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf 14
Complexity Analysis Dimension of i-vectors 15
Sparseness Analysis of SNR Posteriors • Key idea: If the posterior probabilities of SNR are sparse, we may drop the combinations of that lead to small posterior 16
Sparseness Analysis of SNR Posteriors Combination of target-speaker utterances and test utterances pairs, sorted by SNR posterior prob. 17
PLDA vs. Fast mPLDA Scoring • PLDA: • Complexity: • Fast mPLDA: • Complexity: 20
Contents • Motivation of Work • Conventional PLDA • Mixture of PLDA for Noise Robust Speaker Verification • Experiments on SRE12 • Conclusions
Experiments Evaluation dataset:Common evaluation conditions 3 and 4 of NIST SRE 2012 core set. Parameterization: 19 MFCCs together with energy plus their 1st and 2nd derivatives 60-Dim UBM: gender-dependent, 1024 mixtures Total Variability Matrix: gender-dependent, 500 total factors I-Vector Preprocessing: Whitening by WCCN then length normalization Followed by LDA (500-dim 200-dim) and WCCN PLDA and mPLDA with 150 speaker factors
Evaluation Conditions CC3 CC4
Comparing Scoring Time Common Condition 3 EER Scoring Time (sec.) EER (%) Scoring Time K = 4 K = 3 K = 2
Comparing Scoring Time Common Condition 4 EER Scoring Time (sec.) EER (%) Scoring Time K = 4 K = 3 K = 2
Conclusions • Mixture of SNR-dependent PLDA (mPLDA) is a flexible model that can handle noisy speech with a wide range of SNR • This paper reduces the scoring time of mPLDA by half with minor degradation in performance. • This is achieved by omitting the computation of likelihood terms whose corresponding SNR posterior probabilities are small. • Further information: • http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf
Distribution of SNR in SRE12 Each SNR region is handled by a PLDA Model
Likelihood-Ratio Scores of mPLDA • Same-speaker likelihood: SNR of target and test utterances i-vectors of target and test speakers
Training Data • In NIST 2012 SRE, training utterances from telephone channels are clean, but some of the test utterances are noisy. • We used the FaNT tool to add babble noise to the clean training utterances Babble noise Utterances from microphone channels FaNT From telephone channels