160 likes | 510 Views
IAFPA 2007 Plymouth, July 22-25, 2007. Developments in automatic speaker recognition at the BKA. Michael Jessen, Bundeskriminalamt Franz Broß, Univ. Applied Sciences Koblenz Stefan Gfroerer, Bundeskriminalamt. Some of our motivations for developing automatic speaker recognition.
E N D
IAFPA 2007 Plymouth, July 22-25, 2007 Developments in automatic speaker recognition at the BKA Michael Jessen, Bundeskriminalamt Franz Broß, Univ. Applied Sciences Koblenz Stefan Gfroerer, Bundeskriminalamt
Some of our motivations for developing automatic speaker recognition • Since about ten years general automatic speaker recognition technology has been adapted to meet the demands of forensic applications. • Substantial increase in casework involving foreign languages. Automatic speaker recognition is claimed to be language-independent. • Using automatic speaker recognition as a check against errors in traditional auditory-acoustic speaker identification (cf. collaborative exercise by Tina Cambier-Langeveld).
Stage 1 (2002): Developing a standard automatic speaker recognition system • Material: • Various labspeech data from U Koblenz of 82 male speakers • Labspeech experiment „Pool 2010“ at BKA with 100 male speakers in systematically varied conditions, including Lombard • Methods: • Standard deviation of LPC-Cepstral coefficients • Calculating intraspeaker and interspeaker distances from a total of ca. 12,000 distances in the Koblenz material and 80,000 distances in the BKA material • Noise reduction with Wiener filter • Speech-pause recognition • Results and perspective: • Error rates too high for forensic applications (EER 28% for good-quality material!) • programming of GMM necessary
Intraspeaker distances Estimated Probability Interspeaker distances Logarithmic Euclidian Distance
Stage 2 (2003/04): Improving forensic significance • Material: • Increasing forensic relevance by • re-recording the Pool 2010 data via real GSM-transmissions • Adding background noise, including natural (e.g. traffic, river) • Methods: • Using MFCC and GMM • Different enhancements (Wiener filtering for aperiodic disturbances, adaptive deconvolution for periodic ones) • Calculation of distances between speech samples and GMMs • Adding world model compensation • Results: • Reduction of equal error rates down to 0.01%! Better EER mainly due to • Better enhancement • GMM based on data from several speaking styles, incl. Lombard • world modelcompensation (from 3% to 0.01%)
Stage 3 (2005/06): Learning from the professionals • Selecting and processing a collection of authentic case data, including from U Trier (Köster) • Applying BATVOX to the case data • Getting to know the ASPIC* system from EPFL** (Drygajlo, Meuwly, Alexander etc.) • Project in which the BKA/Koblenz system (SPES***) was supplemented by procedures from ASPIC • Testing this new system with case data • Testing this new system with the NFI-TNO test * ASPIC = Automatic Speaker Individualisation by Computer **EPFL = École Polytechnique Fédérale de Lausanne (Swiss Federal Institute of Technology) ***SPES = Sprechererkennungssystem (speaker identification system)
Stage 4 (2006/07): Further technological developments • Using PLPCC (Perceptual Linear Prediction Cepstral Components) instead of MFCC, which was inspired by RASTA-PLP used by Drygajlo et al. This change of parameters lead to significant improvements (from 27% to 17 % EER). • Optimisations, including • Increasing number of feature vectors with number of GMM modules • Experimenting with different sizes and roll-off values of windows in the frequency domain • Improvement through averaging across different runs with different parameter settings • Currently: Implementing Delta-features
Fully automatic vs. manually-guided automatic speaker recognition
Casework experience (conclusion) • Much better performance is possible for lab speech data than for real case data. • The more varied the suspect material, the better the applicability of the Drygajlo normalisation and the better the results. So, it is an advantage to have control over the recording or compilation of the suspect material and to practice a manually-guided approach. • Impostor tests are useful, i.e. when edits from a non-relevant conversation partner are included in the analysis as a control. • Whether application is to German or another language so far makes no practical difference.
Casework experience (continued) • Only in about 1/3 of the cases can automatic speaker recognition be applied at all; otherwise the signals are too poor, too short or technical/behavioural mismatches between questioned and suspect speaker occur. • If applicable, the results are usually congruent with the results from the auditory-acoustic method. • Discussion: what to do in case of non-congruent results?
evidence distribution of between-speaker similarities distribution of within-speaker similarities estimated probability density Log(LR) Direct method vs. Drygajlo normalisation* Drygajlo normalisation: intra-speaker variability is modelled from different recordings of the same speaker in the case (suspect) Direct method: intra-speaker variability is modelled from the within-speaker comparisons in a population of speakers unrelated to the case * In collaboration with Didier Meuwly, Anil Alexander etc.
Test results with case data SPES - direct method SPES with Drygajlo normalisation correct classification incorrect classification non liquet General result: compared with the result achieved with lab data (0,01 % EER), EER in real case data rose to 35% - 28% (2005 to early 2006).
Analysis of discrepancies between automatic (SPES) vs. auditory-acoustic (BKA) method • SPES identical – BKA non-identical: 2x voices similar, but linguistic/phonetic differences • SPES identical – BKA non liquet: 2x • SPES non-identical – BKA identical: 2x • SPES non-identical – BKA non liquet: 2x • SPES non liquet – BKA identical: 4x • SPES non liquet – BKA non-identical: 1x If SPES gave a discrepant non-identical or non liquet result, it was because of poor technical quality, very short duration or technical/behavioural mismatch between questioned and suspect speaker.
NFI/TNO-Test with MFCC with PLPCC
speech data corpus (read and spontaneous speech) within-speaker similarity evidence between-speaker similarity Drygajlo-normalisation questioned recording (spontaneous) recording of suspect spontaneous speech recording of suspect read speech R or SDB C or TDB suspect questioned recording population read speech samples only