1 / 21

A Baseline System for Speaker Recognition

A Baseline System for Speaker Recognition. C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA. Outline. Introduction Baseline speaker recognition system NIST 2002 evaluation Conclusion and perspective. Introduction.

mercer
Download Presentation

A Baseline System for Speaker Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA

  2. Outline • Introduction • Baseline speaker recognition system • NIST 2002 evaluation • Conclusion and perspective C. Mokbel - UOB - NIST2002

  3. Introduction • A baseline system has been built and was used in the NIST 2002 speaker recognition evaluation • GMM based system • Normalization using z-norm • Adaptation technique used to estimate speaker model starting from world model C. Mokbel - UOB - NIST2002

  4. Baseline Speaker Recognition System • Feature extraction: • Speech recognition based feature vectors • 13 MFCC coefficients including the energy on logarithmic scale • + first and second order derivative • Leading to 39 feature parameters • Preprocessing using cepstral mean normalization C. Mokbel - UOB - NIST2002

  5. Baseline Speaker Recognition System • GMM modeling for both hypotheses: speaker and non speaker (world) • EM algorithm to train the world model (Baum-Welch) • Initialization using LBG VQ • Speaker model: adapted mean vectors from the world model • Approximation of the “unified adaptation approach” (“Online Adaptation of HMMs to Real-Life Conditions: A Unified Framework”, IEEE Trans. on SAP Vol. 9, n 4, may 2001) IEEE Trans. on SAP Vol. 9, n 4, may 2001) C. Mokbel - UOB - NIST2002

  6. Baseline Speaker Recognition System • Speaker Adaptation: • World model Gaussian distributions grouped in a binary tree • Speaker data driven determination of the Gaussian classes • MLLR applied based on these classes: only means of Gaussian distributions are adapted • MAP applied to the leaves Gaussian distributions C. Mokbel - UOB - NIST2002

  7. Baseline Speaker Recognition System • Building the Gaussian tree bottom up: • Grouping two by two the closest Gaussian distributions • Distance between 2 Gaussian distributions is equal to the loss in the likelihood of the associated data if the two Gaussian are merged in a unique Gaussian C. Mokbel - UOB - NIST2002

  8. Baseline Speaker Recognition System • After the E-step of the EM algorithm the weights associated to the leaves of the tree are propagated through the tree up to the root • Going from the root to the leaves, nodes are selected whenever one of their two children has a weight less than a threshold • This defines a partition that will be used in an MLLR algorithm C. Mokbel - UOB - NIST2002

  9. Baseline Speaker Recognition System • MAP algorithm: • Estimated Gaussian means parameters at the leaves are smoothed using a fixed weight with the parameters of the world Gaussian C. Mokbel - UOB - NIST2002

  10. Baseline Speaker Recognition System • Given a target speaker model ls, the world model lw and a test utterance X, the score for this utterance is computed as the log likelihood ratio: s = log [p(X/ls) / p(X/lw)] • This score should be normalized due to the fact that the world model is not precise C. Mokbel - UOB - NIST2002

  11. Baseline Speaker Recognition System • Normalization using the z-norm: • Few impostors utterances are used • A score is computed for every utterance • The different scores define a distribution per target speaker • Target speakers distributions should be similar for a decision using a unique threshold • Reduce and center the distribution ns = a * s + b C. Mokbel - UOB - NIST2002

  12. Baseline Speaker Recognition System • Based on the data from the 2001 evaluation a DET curve can be plotted • Find the optimal decision threshold that minimize the cost defined by NIST’2002, i.e.: Cdet = Cmis*Prmiss/target*Prtarget + CFalseAlarm*PrFalseAlarm/NonTarget*(1-Prtarget) C. Mokbel - UOB - NIST2002

  13. NIST 2002 evaluation • Feature vector: 13 MFCCs + 13 d + 13 d2 • Cepstral Mean Normalization • Gender dependent GMM with 256 Gaussian mixtures for world model • Trained on a subset of the cellular data of NIST 2001 evaluation C. Mokbel - UOB - NIST2002

  14. NIST 2002 evaluation • Target speaker model adapted from world model • For every iteration and after the E step • Threshold (cumulative probability = 3.0) to select tree nodes • MLLR used to update the Gaussian means • Approximated MAP to smooth the MLLR estimated parameters: linear combination between the MLLR estimated mean (0.8) and the world (a priori) mean (0.2) C. Mokbel - UOB - NIST2002

  15. NIST 2002 evaluation • 16 male and 21 female speakers (NIST 2001) used as impostors (~8 test files from each) • The pseudo-impostors scores define a distribution used to z-normalize the score for a given target speaker • Global threshold estimated on NIST 2001 data in order to minimize the cost C. Mokbel - UOB - NIST2002

  16. NIST 2002 evaluation • System characteristics: • CPU time on a pentium III 800 MHz: 2.1 ms per frame and per speaker for speaker model adaptation 0.92 ms per frame for the test • Memory usage: ~360 Kbytes per test C. Mokbel - UOB - NIST2002

  17. NIST 2002 evaluation • Results: • Cdet = 0.100292 • Min Cdet = 0.097833 • DET Curve: C. Mokbel - UOB - NIST2002

  18. NIST 2002 evaluation C. Mokbel - UOB - NIST2002

  19. NIST 2002 evaluation C. Mokbel - UOB - NIST2002

  20. NIST 2002 evaluation C. Mokbel - UOB - NIST2002

  21. Conclusions and perspectives • A new baseline system has been developed and evaluated • A lot of work to be done, mainly: • Optimize the feature extraction module • Implement the complete Unified Adaptation approach • Investigate new normalization strategies • Integrate automatic labeling of speech segments C. Mokbel - UOB - NIST2002

More Related