260 likes | 328 Views
Dan Gillick July 27, 2004. Speaker Detection Without Models. Motivation. Want to develop a speaker ID algorithm that: captures sequential information takes advantage of extended data combines well with existing baseline systems. The Algorithm.
E N D
Dan Gillick July 27, 2004 Speaker Detection Without Models
Motivation Want to develop a speaker ID algorithm that: • captures sequential information • takes advantage of extended data • combines well with existing baseline systems Speaker Detection Without Models
The Algorithm • Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames. • We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture. • The comparisons are guided by token-level alignments extracted from a speech recognizer. Speaker Detection Without Models
Front-End Using 40 MFCC features per 10ms frame • 19 Cepstrals and Energy (C0) • Their deltas Speaker Detection Without Models
The Algorithm: Overview Cut the test and target data into tokens • use word or phone-level time-alignments from the SRI recognizer • note that these alignments have lots of errors (both word errors and alignment errors) Speaker Detection Without Models
The Algorithm: Overview Compare test and target data • Take the first test token • Find every instance of this token in the target data • Measure the distance between the test token and each target instance • Move on to the next test token Speaker Detection Without Models
The Algorithm Test data Training data Speaker Detection Without Models
The Algorithm Test data Training data Hello “take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output Speaker Detection Without Models
The Algorithm Test data Training data Hello Hello (1) Hello (2) Hello (3) “Find every instance of this token in the target data” Speaker Detection Without Models
The Algorithm Test data Training data Hello Hello (1) Distance = 25 Hello (2) Euclidian distance function Hello (3) Distance = 25 “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Speaker Detection Without Models
The Algorithm Test data Training data Hello Hello (1) Distance = 25 Hello (2) Distance = 40 Euclidian distance function Hello (3) Distance = 40 “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Speaker Detection Without Models
The Algorithm Test data Training data Hello Hello (1) Distance = 25 Hello (2) Distance = 40 Euclidian distance function Hello (3) Distance = 18 Distance = 18 “Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances Speaker Detection Without Models
The Algorithm: Distance Function Euclidian distance function Hello (test) Hello (3) Distance = 18 • But these instances have different lengths. How do we line up the frames? Here are some possibilities: • 1. Line up the first frames and cut off the longer at the shorter • 2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance. • 3. Use dynamic time warping (DTW) Speaker Detection Without Models
The Algorithm: Take the 1-Best Test data Training data Hello (1) Distance = 25 Hello Token Score = 18 Hello (2) Distance = 40 Hello (3) Distance = 18 Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information. Speaker Detection Without Models
The Algorithm: Scoring Test data Training data Hello Token Score = 18 my Token Score = 16.5 name Token Score = 21 Etc… • So we accumulate scores for each token. What do we do with these? Some options: • 1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score) • 2. Focus on some subset of the scores • a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ] • b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ] Speaker Detection Without Models
Normalization • Most systems use a UBM (universal background model) to center the test pieces • Since this system has no model, we create a background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data • ZNorm to center the “models” • Find the mean score for each “model” or training set by running a number of held-out imposters against each one. Speaker Detection Without Models
Results Results reported on split 1 (of 6) of Switchboard I (1624 test vs. target scores) Speaker Detection Without Models
Results For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set Speaker Detection Without Models
Results For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set Speaker Detection Without Models
Results For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set Speaker Detection Without Models
Results For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set Speaker Detection Without Models
Results How do positive and negative evidence compare? Word-bigrams + bkg (positive evidence) 3.16% EER Word-bigrams + bkg (negative evidence) 26.5% EER Speaker Detection Without Models
Results How is the system effected by errorful recognizer transcripts? Word bigrams + bkg + znorm (recognized transcripts) 1.83% EER Word bigrams + bkg + znorm (true transcripts) 1.16% EER Speaker Detection Without Models
Results How does the system combine with the GMM? This experiment was done on the first half (splits 1,2,3) of Switchboard I EER DCF SRI GMM system 0.97 0.04806 Best phone-bigram system 1.46 0.06110 GMM + phone-bigrams 0.49 0.02040 Speaker Detection Without Models
Future Stuff • Try larger background population, larger znorm set • Try other, non-Euclidian distance functions • Change the front-end features (Feature mapping) • Run the system on Switchboard II; 2004 eval. data • Dynamic token selection • While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring. Speaker Detection Without Models
Thanks Steve (wrote all the DTW code, versions 1 through 5…) Barry (tried to make my slides fancy) Barbara Everyone else in the Speaker ID group Speaker Detection Without Models