Developments in automatic speaker recognition at the BKA

IAFPA 2007 Plymouth, July 22-25, 2007 Developments in automatic speaker recognition at the BKA Michael Jessen, Bundeskriminalamt Franz Broß, Univ. Applied Sciences Koblenz Stefan Gfroerer, Bundeskriminalamt

Some of our motivations for developing automatic speaker recognition • Since about ten years general automatic speaker recognition technology has been adapted to meet the demands of forensic applications. • Substantial increase in casework involving foreign languages. Automatic speaker recognition is claimed to be language-independent. • Using automatic speaker recognition as a check against errors in traditional auditory-acoustic speaker identification (cf. collaborative exercise by Tina Cambier-Langeveld).

Stage 1 (2002): Developing a standard automatic speaker recognition system • Material: • Various labspeech data from U Koblenz of 82 male speakers • Labspeech experiment „Pool 2010“ at BKA with 100 male speakers in systematically varied conditions, including Lombard • Methods: • Standard deviation of LPC-Cepstral coefficients • Calculating intraspeaker and interspeaker distances from a total of ca. 12,000 distances in the Koblenz material and 80,000 distances in the BKA material • Noise reduction with Wiener filter • Speech-pause recognition • Results and perspective: • Error rates too high for forensic applications (EER 28% for good-quality material!) • programming of GMM necessary

Intraspeaker distances Estimated Probability Interspeaker distances Logarithmic Euclidian Distance

Stage 2 (2003/04): Improving forensic significance • Material: • Increasing forensic relevance by • re-recording the Pool 2010 data via real GSM-transmissions • Adding background noise, including natural (e.g. traffic, river) • Methods: • Using MFCC and GMM • Different enhancements (Wiener filtering for aperiodic disturbances, adaptive deconvolution for periodic ones) • Calculation of distances between speech samples and GMMs • Adding world model compensation • Results: • Reduction of equal error rates down to 0.01%! Better EER mainly due to • Better enhancement • GMM based on data from several speaking styles, incl. Lombard • world modelcompensation (from 3% to 0.01%)

Stage 3 (2005/06): Learning from the professionals • Selecting and processing a collection of authentic case data, including from U Trier (Köster) • Applying BATVOX to the case data • Getting to know the ASPIC* system from EPFL** (Drygajlo, Meuwly, Alexander etc.) • Project in which the BKA/Koblenz system (SPES***) was supplemented by procedures from ASPIC • Testing this new system with case data • Testing this new system with the NFI-TNO test * ASPIC = Automatic Speaker Individualisation by Computer **EPFL = École Polytechnique Fédérale de Lausanne (Swiss Federal Institute of Technology) ***SPES = Sprechererkennungssystem (speaker identification system)

Stage 4 (2006/07): Further technological developments • Using PLPCC (Perceptual Linear Prediction Cepstral Components) instead of MFCC, which was inspired by RASTA-PLP used by Drygajlo et al. This change of parameters lead to significant improvements (from 27% to 17 % EER). • Optimisations, including • Increasing number of feature vectors with number of GMM modules • Experimenting with different sizes and roll-off values of windows in the frequency domain • Improvement through averaging across different runs with different parameter settings • Currently: Implementing Delta-features

Fully automatic vs. manually-guided automatic speaker recognition

Casework experience (conclusion) • Much better performance is possible for lab speech data than for real case data. • The more varied the suspect material, the better the applicability of the Drygajlo normalisation and the better the results. So, it is an advantage to have control over the recording or compilation of the suspect material and to practice a manually-guided approach. • Impostor tests are useful, i.e. when edits from a non-relevant conversation partner are included in the analysis as a control. • Whether application is to German or another language so far makes no practical difference.

Casework experience (continued) • Only in about 1/3 of the cases can automatic speaker recognition be applied at all; otherwise the signals are too poor, too short or technical/behavioural mismatches between questioned and suspect speaker occur. • If applicable, the results are usually congruent with the results from the auditory-acoustic method. • Discussion: what to do in case of non-congruent results?

Questions?

evidence distribution of between-speaker similarities distribution of within-speaker similarities estimated probability density Log(LR) Direct method vs. Drygajlo normalisation* Drygajlo normalisation: intra-speaker variability is modelled from different recordings of the same speaker in the case (suspect) Direct method: intra-speaker variability is modelled from the within-speaker comparisons in a population of speakers unrelated to the case * In collaboration with Didier Meuwly, Anil Alexander etc.

Test results with case data SPES - direct method SPES with Drygajlo normalisation correct classification incorrect classification non liquet General result: compared with the result achieved with lab data (0,01 % EER), EER in real case data rose to 35% - 28% (2005 to early 2006).

Analysis of discrepancies between automatic (SPES) vs. auditory-acoustic (BKA) method • SPES identical – BKA non-identical: 2x voices similar, but linguistic/phonetic differences • SPES identical – BKA non liquet: 2x • SPES non-identical – BKA identical: 2x • SPES non-identical – BKA non liquet: 2x • SPES non liquet – BKA identical: 4x • SPES non liquet – BKA non-identical: 1x If SPES gave a discrepant non-identical or non liquet result, it was because of poor technical quality, very short duration or technical/behavioural mismatch between questioned and suspect speaker.

NFI/TNO-Test with MFCC with PLPCC

speech data corpus (read and spontaneous speech) within-speaker similarity evidence between-speaker similarity Drygajlo-normalisation questioned recording (spontaneous) recording of suspect spontaneous speech recording of suspect read speech R or SDB C or TDB suspect questioned recording population read speech samples only

Developments in automatic speaker recognition at the BKA

Developments in automatic speaker recognition at the BKA

Presentation Transcript

Speaker Recognition

Speaker Recognition Research in Joensuu

Speaker Recognition

Automatic Speech Recognition

Spectral Features for Automatic Text-Independent Speaker Recognition

Speaker Recognition

Automatic speech recognition

SPEAKER RECOGNITION

Speaker Recognition

Speaker Recognition

Speaker Recognition

Speaker Recognition Experiment

Automatic Idiom Recognition

Automatic Speaker Recognition In Forensic Environment

Speaker Recognition

Speaker Recognition

Speaker Recognition

Robust Speaker Recognition

Automatic Speaker Recognition in Military Environment

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future

Using Speaker Recognition

Automatic Speech Recognition