Speaker Recognition

University of Joensuu, Department of Computer Science PUMS 2003-2004 –seminaari 14.10.2004 Turku Speaker Recognition Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen

Research Group PUMS project Juhani Saastamoinen Project manager Pasi Fränti Professor Evgeny Karpov Project researcher Tomi Kinnunen Researcher Ismo Kärkkäinen Clustering algorithms Ville Hautamäki Project researcher

PUMS & JoY • Speaker Recognition • PUMS season 2003-2004: • Identification, no verification • Port it in mobile phone • Feature fusion • Real-time • http://cs.joensuu.fi/pages/pums

Speaker Recognition Speaker Verification Speaker Identification Is this Bob’s voice? Whose voice is this? ? + (Claim) Identification Verification Imposter! Application Scenarios

Speech Audio Signal Processing Speaker Modelling Feature Vectors Recognition: min. MSE within DB over input speech Identification System Add trained speaker profiles Use all profiles in recognition Speaker Profile Database Decision

Results 2003-2004 TCL/TK (HY) console UI SpeakerProfiler sprofiler Windows console UI Winsprofiler Series60 ProfMatch Epocsprofiler common speaker recognition app. interface Fusion Real-time srlib Speech features (HY) DB

Planned Results Large scale database Teleconference Applications Access control Mobile phone login? Results 2003-2004 sprofiler SpeakerProfiler Winsprofiler ProfMatch Epocsprofiler common speaker recognition app. interface common speaker recognition app. interface Fusion Verification Real-time Segmentation srlib Speech features (HY) VAD DB

System in Mobile Phone Port to Symbian OS with Series 60 UI platform

Symbian Phones • Series 60 phone features: • 16 MB ROM • 8 MB RAM • 176 x 208 display • 32-bit ARM-processor • No floating-point unit!!! UIQ Series 60 Series 80

FFTGEN • Multiplication results must fit in 32 bits: truncate multiplication inputs • FFTGEN: Truncate to 16/16 bits (“16/16 FFT”) FFT layer input X FFT Twiddle Factor 16-bit integer 16-bit integer X 32-bit multiplication result 16 used bits 16 crop-off bits FFT layer output (part of it) Crop-off for next layer: 16 bits! 16-bit integer

Proposed Information Preserving “22/10 FFT” • Approximate DFT operator F with G • Increase ||F-G||, preserve more signal information • minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 • Truncate multiplication inputs to 22/10 bits (signal/op) FFT layer input X FFT Twiddle Factor 32-bit integer 22 used bits 10 crop-off bits 32-bit integer, 22 bits used 16-bit integer, 10 bits used X FFT layer output (part of it) Crop-off for next layer: 10 bits 32-bit multiplication result

16/16 22/10 Scale of Error in Proposed FFT

Mobile Phone Results

Improving Accuracy by Information Fusion feature vector Feature set 1 (e.g. 5 MFCCs) ... ... Feature set 2 (e.g. F0 + -F0) Feature set 3 (e.g. formants F1,F2,F3) Classifier 1 score 1 score 2 Classifier 2 Score combiner Decision Classifier 3 score 3

Fusion succesfull Fusion sucks Feature set combination BASELINE: Best individual Feature-level fusion Score-level fusion Decision-level fusion MFCC + MFCC 16.8 15.8 14.6 LPCC + LPCC 16.0 19.8 14.7 ARCSIN + ARCSIN 17.1 18.2 16.8 FMT + FMT 19.4 29.9 52.0 All feature sets 16.0 21.2 15.2 12.6 Information Fusion Results N/A N/A N/A N/A

Speed up NN search Fill buffer with new data Frame blocking Vantage-point tree (VPT) indexing of the code vectors Reducing # vectors Silence detection Feature extraction 1. Averaging 2. Random sampling 3. Decimation 4. Clustering (LBG) Reduce # speakers Pre-quantization Matching 1. Static pruning 2. Hierarchical pruning 3. Adaptive pruning 4. Confidence-based pruning Real-Time Speaker Identification Speech input stream Speaker database Speaker 1 model v All frames ... v Speaker N model Non-silent frames v v Feature vectors v v Active speakers Pruned speakers List of candidate speakers Redused set of vectors v v Database pruning v No Yes Decision ? END

4 x realtime Results: Baseline System (TIMIT) (Average length of test utterance = 8.9 s) Real-time requirement satisfied

9 x realtime Results: Pre-Quantization (TIMIT) (Codebook size = 64) • Averaging performs worst, clustering best • About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy

11 x realtime Results: Pruning Variants (TIMIT) (Codebook size = 64) • Recommended method : adaptive pruning (AP)

33 x realtime Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) • Recommended method : Combination of pre-quantization and pruning (PQP)

13:1 speed-up without degradation 9:1 to 10:1 speed-up without degradation Results : VQ vs. GMM (TIMIT) (Average length of test utterance = 8.9 s) VQ GMM Best time : 0.27 s = 33 x realtime @ error rate 0.32 % Smallest error : 0.00 % @ 0.31 s = 28 x realtime Best time : 0.18 s = 49 x realtime @ error rate 0.16 % Smallest error : 0.16 % @ 0.18 s = 49 x realtime

13:1 to 16:1 speedup with minor degradation 23:1 to 34:1 speedup with minor degradation Results : VQ vs. GMM (NIST-1999) (Average length of test utterance = 30.4 s) VQ GMM Best time : 0.82 s = 37 x realtime @ error rate 19.36 % Smallest error: 16.90 % @ 37.9 s = 0.8 x realtime Best time : 0.48 s = 63 x realtime @ error rate 19.22 % Smallest error : 17.34 % @ 11.4 s = 3 x realtime

Speaker Recognition