210 likes | 358 Views
A Baseline System for Speaker Recognition. C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA. Outline. Introduction Baseline speaker recognition system NIST 2002 evaluation Conclusion and perspective. Introduction.
E N D
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA
Outline • Introduction • Baseline speaker recognition system • NIST 2002 evaluation • Conclusion and perspective C. Mokbel - UOB - NIST2002
Introduction • A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation • GMM based system • Normalization using z-norm • Adaptation technique used to estimate speaker model starting from world model C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Feature extraction: • Speech recognition based feature vectors • 13 MFCC coefficients including the energy on logarithmic scale • + first and second order derivative • Leading to 39 feature parameters • Preprocessing using cepstral mean normalization C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • GMM modeling for both hypotheses: speaker and non speaker (world) • EM algorithm to train the world model (Baum-Welch) • Initialization using LBG VQ • Speaker model: adapted mean vectors from the world model • Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001) C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Speaker Adaptation: • World model Gaussian distributions grouped in a binary tree • Speaker data driven determination of the Gaussian classes • MLLR applied based on these classes: only means of Gaussian distributions are adapted • MAP applied to the leaves Gaussian distributions C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Building the Gaussian tree bottom up: • Grouping two by two the closest Gaussian distributions • Distance between 2 Gaussian distributions is equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root • Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold • This defines a partition that will be used in an MLLR algorithm C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • MAP algorithm: • Estimated Gaussian means parameters at the leaves are smoothed using a fixed weight with the parameters of the world Gaussian C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Given a target speaker model ls, the world model lw and a test utterance X, the score for this utterance is computed as the log likelihood ratio: s = log [p(X/ls) / p(X/lw)] • This score should be normalized due to the fact that the world model is not precise C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Normalization using the z-norm: • Few impostors utterances are used • A score is computed for every utterance • The different scores define a distribution per target speaker • Target speakers distributions should be similar for a decision using a unique threshold • Reduce and center the distribution ns = a * s + b C. Mokbel - UOB - NIST2002
Baseline Speaker Recognition System • Based on the data from the 2001 evaluation a DET curve can be plotted • Find the optimal decision threshold that minimize the cost defined by NIST’2002, i.e.: Cdet = Cmis*Prmiss/target*Prtarget + CFalseAlarm*PrFalseAlarm/NonTarget*(1-Prtarget) C. Mokbel - UOB - NIST2002
NIST 2002 evaluation • Feature vector: 13 MFCCs + 13 d + 13 d2 • Cepstral Mean Normalization • Gender dependent GMM with 256 Gaussian mixtures for world model • Trained on a subset of the cellular data of NIST 2001 evaluation C. Mokbel - UOB - NIST2002
NIST 2002 evaluation • Target speaker model adapted from world model • For every iteration and after the E step • Threshold (cumulative probability = 3.0) to select tree nodes • MLLR used to update the Gaussian means • Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2) C. Mokbel - UOB - NIST2002
NIST 2002 evaluation • 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each) • The pseudo-impostors scores define a distribution used to z-normalize the score for a given target speaker • Global threshold estimated on NIST 2001 data in order to minimize the cost C. Mokbel - UOB - NIST2002
NIST 2002 evaluation • System characteristics: • CPU time on a pentium III 800 MHz: 2.1 ms per frame and per speaker for speaker model adaptation 0.92 ms per frame for the test • Memory usage: ~360 Kbytes per test C. Mokbel - UOB - NIST2002
NIST 2002 evaluation • Results: • Cdet = 0.100292 • Min Cdet = 0.097833 • DET Curve: C. Mokbel - UOB - NIST2002
NIST 2002 evaluation C. Mokbel - UOB - NIST2002
NIST 2002 evaluation C. Mokbel - UOB - NIST2002
NIST 2002 evaluation C. Mokbel - UOB - NIST2002
Conclusions and perspectives • A new baseline system has been developed and evaluated • A lot of work to be done, mainly: • Optimize the feature extraction module • Implement the complete Unified Adaptation approach • Investigate new normalization strategies • Integrate automatic labeling of speech segments C. Mokbel - UOB - NIST2002