160 likes | 354 Views
IRISA 2003 SPEAKER RECOGNITION SYSTEM. NIST Speaker Recognition Workshop, June 24-25, 2003. 1sp DETECTION Limited Data M. BEN, G. GRAVIER, A. OZEROV & F. BIMBOT for the ELISA consortium. Outline. IRISA 2003 system Introduction Description NIST’03 SRE results Experiments Front-end
E N D
IRISA 2003 SPEAKER RECOGNITION SYSTEM NIST Speaker Recognition Workshop, June 24-25, 2003 1sp DETECTION Limited Data M. BEN, G. GRAVIER, A. OZEROV & F. BIMBOT for the ELISA consortium
Outline • IRISA 2003 system • Introduction • Description • NIST’03 SRE results • Experiments • Front-end • Modeling • Score normalization • Conclusions
IRISA 2003 system • Introduction IRISA is a member of the ELISA consortium IRISA 2003 system is based on a newly developed audio segmentation software : audioseg Web links: - IRISA/METISS : http://www.irisa.fr/metiss/accueil.html - ELISA consortium : http://elisa.ddl.ish-lyon.cnrs.fr
IRISA 2003 system • Description front-end • • 20 ms frames every 10 ms • • 24 filter bank over 340 - 3400 Hz 16 LFCC • RASTA filtering (secondary system) • deltas + delta log-energy are added • • frame selection : bi-gaussian modeling of the energy with ML classification of the frames (speech/silence) • global feature normalization (zero mean, unit var.)
IRISA 2003 system • Description background modeling speaker models • • gender-dependent background models • 256 GMMs with diagonal covariance matrices • prim. system : cellular data (NIST’01) • second. system : cellular+landline data (NIST’01) • • adapted from the background models with MAP estimation of the parameters (mean only adaptation)
IRISA 2003 system • Description scoring • frame score : • log-likelihood ratio using the 10-best matching gaussians in the background model • utterance score : • NT : number of frames in the utterance
IRISA 2003 system • Description score normalization : DT-norm • D-norm : D(spk) : symmetric Kullback-Leibler distance between the speaker (spk) and the background models • DT-norm: : mean and standard deviation of the D-norm scores of the test utterance using cohort impostor models (50 mal. + 50 fem. from NIST’01 SRE)
IRISA 2003 system • NIST’03 SRE results : 1sp-limited DET curves • 2 systems submited : • IRI_1 : primary • baseline system • IRI_2 : secondary • RASTA front-end • mixed cell.+land. data for world models DCF min actual IRI_1 0.3176 0.3205 IRI_2 0.33330.3396
Experiments • Front-end : frame selection • speech/silence classification based on a bi-gaussian modeling of the frame energy ML classification or threshold-based selection ? ( t = 2 - c.2 ) constant coef. to optimise G1(1 ,1) G2 (2,2) energy
Experiments • Front-end : frame selection • speech/silence classification based on a bi-gaussian modeling of the frame log-energy ML classification or threshold-based selection ? ( t = 2 - c.2 ) constant coef. to optimise G1(1 ,1) G2 (2,2) log-energy
Experiments • Front-end : frame selection • SYS_fs1 : ML selection (E) • SYS_fs2 : optimal threshold-based selection (E) : c = 0.8 • SYS_fs3 : ML selection (LogE) • SYS_fs4 : optimal threshold-based selection (LogE) : c = 2.5 • energy (E) bi-gauss. modeling with ML selection of the frames performs the best • drastic selection : about 50 % of the frames are discarded ! NIST ’03 SRE data
Experiments • Front-end : feature normalization - st-norm : short-term norm. (0 mean, unit var.) on a sliding window (3 sec.) - lt-norm : long term norm. (0 mean, unit var.) on all features • st-norm is applied before frame selection • lt-norm can be applied before or after frame selection • SYS_fn1 : lt-norm + frame selection • SYS_fn2 : st-norm + frame selection • SYS_fn3 : frame selection + lt-norm NIST ’02 SRE data (subset)
Experiments • Front-end : feature normalization • - SYS_fn5 : frame selection + lt-norm • baseline system (prim.) • SYS_fn6 : st-norm + frame selection+ lt-norm • short-term normalization does not seem to work well (buggy?) • long-term normalization at the end of front-end seems to be crucial • best results obtained with frame selection followed by long-term normalization of remaining features NIST ’03 SRE data
Experiments • Modeling • Does size matter ? • - SYS_nbg1 : 256 component GMMs • (baseline) • SYS_nbg2 : 2048 component GMMs • no gain of performance with 2048 gaussians in the mixture • may be due to the frame selection process which remove a large amount of frames (?) NIST ’02 SRE data (subset)
Experiments • Score normalization • SYS_sn1 : no score norm. • SYS_sn2 : T-norm • SYS_sn3 : DT-norm • SYS_sn4 : DZT-norm • all score normalizations improve performance • DT-norm seems to perform better than T-norm and DZT-norm at minimum DCF point NIST ’02 SRE data (subset)
Conclusions • validation of the new toolkit audioseg • new baseline system performs well • frame selection is crucial for good performance • work on feature transformations (PCA, ICA ...) • model adaptation on test data • hierarchical structural model adaptation • IRISA participation to NIST’03 SRE • Perspectives