160 likes | 307 Views
AVBPA 2003 Guildford, UK, June 9-11, 2003. A Speaker Pruning Algorithm for Real-Time Speaker Identification. University of Joensuu, FINLAND Department of Computer Science. Tomi Kinnunen, Evgeny Karpov, Pasi Fränti. Abstract. Speaker identification task is computationally very expensive
E N D
AVBPA 2003 Guildford, UK, June 9-11, 2003 A Speaker Pruning Algorithm for Real-Time Speaker Identification University of Joensuu, FINLAND Department of Computer Science Tomi Kinnunen, Evgeny Karpov, Pasi Fränti
Abstract • Speaker identification task is computationally very expensive • Most computation originates from calculating the matching scores • Proposed method: drop out unlikely speakers “on the fly” • Reduced computation time with slightly increased error rate
VQ-Based Speaker Identification Speaker model database Unknown voice Loop over the whole database C1 C2 C3 ... ... ... Feature extraction X Ci Ci Matching ... ... ... { D(X,C1),…,D(X,Ci), …,D(X,CN) } CN Select minimum
Towards Speaker Pruning ... • Only a few vectors is enough to rule out most of the speakers • Confidence increases when more vectors are processed Speaker pruning: Drop the unlikely speakers out from competetion when more data arrives No more distance calculations needed for the pruned speakers
1st pruning 2ndpruning 3rd pruning Decision Illustration of Pruning Unknown speakers voice sample
Variant 1: Static Pruning Idea: Maintain an ordered list of match scores, and prune out K worst speakers Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ Ø AND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Remove K most dissimilar models from C ; END RETURN arg mini { D(X, Ci) | Ci ЄC } ;
Variant 2: Adaptive Pruning Idea: determine a pruning threshold θ from the distribution of active speakers distances Let C = {C1,…,CN} be the set of all speaker models ; Let X = Ø ; WHILE (C ≠ ØAND vectors left in input buffer) DO Insert M new vectors from input buffer to set X ; Re-evaluate dissimilarities D(X, Ci) for all Ci in C ; Compute μ and σ of the distribution { D(X, Ci) | Ci ЄC }; Let θ = μ + η σbe the pruning threshold ; Remove all speakers i from C satisfying D(X, Ci) > θ ; END RETURN arg mini { D(X, Ci) | Ci ЄC } ;
Illustration of Adaptive Pruning Histograms of matching scores as a function of time Pruned speakers Frequency of occurrence Match score (distance)
μ μ+ησ Parameters of the Variants • Static pruning: Number of speakers to prune at each interval • Adaptive pruning: The η - parameter in the pruning threshold • It is assumed that distances follow a Gaussian distribution with mean μand variance σ2 • ηspecifies a certain confidence interval
Experimental Setup • TIMIT-corpus: • N = 630 American English speakers, clean speech • Sample rate Fs = 8 kHz, 16 bps resolution • Pre-processing and MFCC feature extraction : • - Silence removed, pre-emphasis H(z) = 1 - 0.97z-1 • - 30 ms Hamming window, shifted by 10 ms • - 27 triangular bandpass filters spaced equally on mel-scale • - 0th cepstral coefficient excluded • Speaker models : • Codebooks of 64 vectors by Linde-Buzo-Gray algorithm • Training data: 8.8 seconds / speaker (without silence)
Evaluation Criteria • Identification error rate + Avg. identification time per speaker • Combined: error rate as a function of time • Reference point: • Full-search (no speaker pruning) achieves 0.15 % error rate (one misclassified speaker) on average in 230 seconds ( 4 minutes)
Error < 0.5 % in 50 seconds Static Pruning [Full search: 0.15 % in 230 seconds]
Error < 0.5 % in 25 seconds Adaptive Pruning [Full search: 0.15 % in 230 seconds]
Static: 5.5 % Adaptive:0.5% Static: 0.5 % Adaptive: 0.18% 25 s. 50 s. Comparison of the Variants [Full search: 0.15 % in 230 seconds]
Conclusions • Speed-up ratio 9:1 with only minor degration in accuracy • Full search: 629/630 correct in 220 seconds • Static pruning: 595/630 correct in 25 seconds • Adaptive pruning: 627/630 correct in 25 seconds • Adaptive variant outperforms static variant • Selection of the parameters not crucial • Easy to apply in practice • Both variants are straightforward to implement • Easily extendable to other models (e.g. GMM)