240 likes | 408 Views
Combining Gaussian Mixture Models with Support Vector Machines for Text-independent Speaker Verification. Jamal Kharroubi Dijana Petrovska-Delacrétaz Gérard C hollet (kharroub, petrovsk, chollet ) @tsi.enst.fr Eurospeech, 3-7 September 2001. Overview. Motivations SVM principles
E N D
Combining Gaussian Mixture Models with Support Vector Machines for Text-independent Speaker Verification Jamal Kharroubi Dijana Petrovska-DelacrétazGérard Chollet(kharroub, petrovsk,chollet)@tsi.enst.fr Eurospeech, 3-7 September 2001
Overview • Motivations • SVM principles • Previous experiment with SVMs for speaker identification • Our method : combine GMMs and SVMs for speaker verification • Database • Experimental protocol • Results • Conclusions and perspectives
Motivations • Gaussian Mixture Models (GMM) • State of the art for speaker verification • Support Vector Machines (SVM) • New and promising technique instatistical learning theory • Discriminative method • Good performance in image processing and multi-modal authentication • Seem to be "à la mode" at Eurospeech 2001 • Combine GMM and SVM for Speaker Verification
SVM Principles • Pattern classification problem :given a set of labelled training data, learn to classify unlabelled test data • Finding decision boundaries that separate the classes, minimising the number of classification errors • SVM are : • Binary classifiers • Capable of determining automatically the complexity of the decision boundary
Separating hyperplane H , with the optimal hyperplane Ho Feature space Input space H y(X) X Class(X) Ho SVM principles, cntd.
SVM Theory Feature Space Input Space Classification Function
Previous experiment with SVM for Speaker Recognition Speaker Identification with SVM : Schmidt and Gish, 1996 • Goal : identify one among a given closed set of speakers • The input vectors of the SVM’s are the spectral parameters • Slightly better performance with SVM’s, with the pairwize classifier • Why these disapointing results? • Too short train/test durations • GMM’s perhaps better suited to model the data • GMM’s perhaps more robust to channel variation
SVM and Speaker Verification • Difficulty : mismatch of the quantity of labelled data,more data available for impostor access than true target • Our preliminary test, with speech feature frames as input to SVM => no satisfactory results • Present approach : • Combine GMM and SVM
Our method of combining GMM and SVM for Speaker Verification • Usual GMM Modeling • Need additional labelled development set for the SVM • Construction of the SVM input vector, with the GMM models • Model globally with the SVM, the client-client against the client-impostor access (with the development data)
WORLDGMMMODEL GMMMODELING WORLD DATA Front-end TARGETGMMMODEL TARGET SPEAKER GMM modellingor adaptation Front-end GMM speaker modeling
HYPOTH.TARGETGMM MOD. Front-end WORLDGMMMODEL Baseline GMM method l Test Speech = LLR SCORE
The score St is added to the ith component of Test segment If If The score St is added to the ith component of Frame t Construction of the SVM input vectors SVM(max) method HYPOTH.TARGETGMM MOD. WORLDGMMMODEL
Construction of the SVM input vectors • The input SVM vector is the concatenation of • Summation and normalization of the SVM input vector by the number of frames of the test segment T
Train Client class Impostor class SVMCLASSIFIER ... ... Test SVMINPUT VECTORCONSTRUCTION Test speech Decision score SVM : Train /Test
Database Nist’99 data splitted in : • Development data = 100 speakers • 2min GMM model • Corresponding test data to train the SVM classifier(519 true and 5190 impostor accesses) • World data = 200 speakers • 4 sex/handset dependent world models • Evaluation data = 100 speakers =449 true and 4490 impostor accesses
Experimental Protocol : Feature Extraction • LFCC parametrization (32.5 ms windows every 10 ms) • Cepstral mean substraction for channel compensation • Feature vector dimension is 33 (16 cep, 16 dcep, log E)(Delta cepstral features on 5-frames windows) • Frame removal algorithm applied onfeature vectors to discard non significant frames (bimodal energy distributions)
Experimental Protocol cont. • GMM Speaker and world models • with 128 mixtures • diagonal covariance matrix • Standard EM algorithm with a max. of 20 iterations • ML and MAP method to construct the client mode • With ELISA Consortium Toolkit • SVM model was trained using a part of the development corpus • Linear kernel is used • With svmFu Package
GMM(ml) / hybrid GMM(ml)+ SVM(max) no normalization
GMM(map) / GMM(ml) No normalization
GMM(map) / GMM(map)+SVM(max) No normalization
Why no improvement when GMM(map)+SVM(max) • ML training -Speaker model trained independently of the World model • MAP training - derive Speaker's model by updating the World parameters via adaptation • Stronger coupling of Speaker <=> World models
The score St is added to the ith components of Test segment Frame t New SVM input vector Representation: SVM(maxRatio) HYPOTH.TARGETGMM MOD. WORLDGMMMODEL
GMM(map)+SVM(max)GMM(map-maxRatio)GMM(map) No normalization GMM(map)+SVM(maxRatio)
Conclusions and Perspectives • Improvements with our hybrid GMM+SVM are dependant on the GMM adaptation method • Better(or similar) results with GMM+SVM method if appropriate SVM input vector representation choosen • Different kernel types and features will be experimented • Normalization techniques will be applied