410 likes | 598 Views
Spectral Features for Automatic Text-Independent Speaker Recognition. Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University of Joensuu. Based on a True Story ….
E N D
Spectral Features for Automatic Text-Independent Speaker Recognition Tomi Kinnunen Research seminar, 27.2.2004 Department of Computer Science University of Joensuu
Based on a True Story … T. Kinnunen: Spectral Features for Automatic Text-Independent Speaker Recognition, Ph.Lic. thesis, 144 pages, Department of Computer Science, University of Joensuu, 2004. Downloadable in PDF from : http://cs.joensuu.fi/pages/tkinnu/research/index.html
Why Study Feature Extraction ? • As the first component in the recognition chain, the accuracy of classification is strongly determined by its selection
Why Study Feature Extraction ? (cont.) • Typical feature extraction methods are directly “loaned” from the speech recognition task • Quite contradictory, considering the “opposite” nature of the two tasks • In general, it seems that currently we are at the best guessing what might be invidual in our speech ! • Because it is interesting & challenging!
Studied Features 1. FFT-implemented filterbanks (subband processing) 2. FFT-cepstrum 3. LPC-derived features 4. Dynamic spectral features (delta features)
Speech Material & Evaluation Protocol • Each test file is splitted into segments of T=350 vectors (about ~ 3.5 seconds of speech) • Each segment is classified by vector quantization • Speaker models are constructed from the training data by RLS clustering algorithm • Performance measure = classification error rate (%)
Computation of Subband Features Windowed speech frame Magnitude spectrum by FFT Smoothing by a filterbank Nonlinear mapping of the filter outputs • Parameters of the filterbank: • Number of subbands • Filter shapes & bandwidths • Type of frequency warping • Filter output nonlinearity Compressed filter ouputs f = (f1,f2, … , fM)T
Frequency Warping… What’s That?! • “Real” frequency axis (Hz) is stretched and compressed locally according to a (bijective) warping function A 24-channel bark-warped filterbank Bark scale
Helsinki TIMIT F-ratio Frequency Frequency Discrimination of Individual Subbands (F-ratio) (Fixed parameters: 30 linearly spaced triangular filters) Low-end (~0-200 Hz) and mid/high frequencies (~ 2 - 4 kHz) are important, region ~200-2000 Hz less important. (However, not consistently!)
Fixed parameters: 30 linearly spaced triangular filters 1. Linear f(x) = x 2. Logarithmic: f(x) = log(1 + x) 3. Cubic: f(x) = x1/3 Helsinki TIMIT Subband Features : The Effect of the Filter Output Nonlinearity Consistent ordering (!) :cubic < log < linear
Fixed parameters: 30 linearly spaced filters, log-compression 1. Rectangular 2. Triangular 3. Hanning Helsinki TIMIT The differences are small, no consistent ordering probably the filter shape is not as crucial as the other parameters Subband Features : The Effect of the Filter Shape
Experiment 1: From 5 to 50 Helsinki TIMIT Subband Features : The Number of Subbands (1) Fixed parameters: linearly spaced / triangular-shaped filters, log-compression Observation: error rates decrease monotonically with increasing number of subbands (in most cases) …
Subband Features : The Number of Subbands (2) Experiment 2: From 50 to 250 Fixed parameters: linearly spaced / triangular-shaped filters, log-compression Helsinki: (Almost) monotonous decrease in errors with increasing number of subbands TIMIT: Optimum number of bands is in the range 50..100 Differences between corpora are (partly) explained by the discrimination curves
Discussion of the Subband Features • (Typically used) log-compression should be replaced with cubic compression or some better nonlinearity • Number of subbands should be relatively high (at least 50 based on these experiments) • Shape of the filter does not seem to be important • Discriminative information is not evenly spaced along the frequency axis • The relative discriminatory powers of subbands depends on the selected speaker population/language/speech content…
Processing is very similar to “raw” subband processing Computation of FFT-Cepstrum Windowed speech frame Magnitude spectrum by FFT Smoothing by a filterbank Common steps Nonlinear mapping of the filter outputs Decorrelation by DCT Coefficient selection Cepstrum vector c = (c1,…,cM)T
1. Linear warping 2. Mel-warping 3. Bark-warping 4. ERB-warping Helsinki TIMIT FFT-Cepstrum : Type of Frequency Warping Fixed parameters: 30 triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0] Helsinki: Mel-frequency warped cepstrum gives the best results on average TIMIT: Linearly warped cepstrum gives the best results on average Same explanation as before: discrimination curves
Helsinki TIMIT FFT-Cepstrum : Number of Cepstral Coefficients ( Fixed parameters: mel-frequency warped triangular filters, log-compression, DCT-transformed filter outputs, 15 lowest cepstral coefficients excluding c[0], codebook size = 64) Minimum number of coefficients around ~ 10, rather independent of the number of filters
Discussion About the FFT-Cepstrum • Same performance as with the subband features, but smaller number of features • For computational and modeling reasons, cepstrum is the preferred method of these two in automatic recognition • The commonly used mel-warped filterbank is not the best choice in general case ! • There is no reason to assume that it would be, since mel-cepstrum is based on modeling of human hearingand originally meant for speech recognition purposes • I prefer / recommend to use linear frequency warping, since: • It is easier to control the amount resolution on desired subbands (e.g. by linear weighting). In nonlinear warping, the relationship between the “real” and “warped” frequency axes is more complicated
In time domain, current sample is approximated as a linear combination of the past p samples : What Is Linear Predictive Coding (LPC) ? • The objective is to determine the LPC coefficients a[k] k=1,…,p such that the squared prediction error is minimized • In the frequency domain, LPC’s define an all-pole IIR-filter whose poles correspond to local maximae of the magnitude spectrum An LPC pole
Computation of LPC and LPC-Based Features Windowed speech frame Autocorrelation computation Solving of Yule-Walker AR equations Levinson-Durbin algorithm LPC coefficients (LPC) Reflection coefficients (REFL) Complex polynomial expansion LPC pole finding Atal’s recursion LAR conversion asin(.) Arcus sine coefficients (ARCSIN) Log area ratios (LAR) Formants (FMT) Linear Predictive Cepstral Coefficients (LPCC) Root-finding algorithm Line spectral frequencies (LSF)
Helsinki TIMIT Linear Prediction (LPC) : Number of LPC coefficients • Minimum number around ~ 15 coefficients (not consistent, however) • Error rates surprisingly small in general ! • LPC coefficients were used directly in Euclidean-distance -based classifier. In literature there is usually warning of the following form : “Do not ever use LPC’s directly, at least with the Euclidean metric.”
Helsinki TIMIT A programming bug??? Comparison of the LPC-Derived Features Fixed parameters: LPC predictor order p = 15 • Overall performance is very good • Raw LPC coefficients gives worst performance on average • Differences between feature sets are rather small • Other factors to be considered: • Computational complexity • Ease of implementation
Helsinki TIMIT Fixed parameters: Codebook size = 64 LPC-Derived Formants • Formants give comparable, and surprisingly good results ! • Why “surprisingly good” ? • 1. Analysis procedure was very simple (produces spurious formants) • 2.Subband processing, LPC, cepstrum, etc… describe the spectrum continuously - formants on the other hand pick only a discrete number of maximum peaks’ amplitudes from the spectrum (and a small number!)
An idea for future study : How about selecting subbands around local maximae? Discussion About the LPC-Derived Features • In general, results are promising, even for the raw LPC coefficients • The differences between feature sets were small • From the implementation and efficiency viewpoint the following are the most attractive: LPCC, LAR and ARCSIN • Formants give (surprisingly) good results also, which indicates indirectly: • The regions of spectrum with high amplitude might be important for speaker recognition
Time trajectory of the original feature Estimate of the 1st time derivative (-feature) Estimate of the 2nd time derivative ( -feature) (M = number of neigboring frames, typically M = 1..3) Dynamic Spectral Features • Dynamic feature: an estimate of the time derivate of the feature • Can be applied to any feature • Two widely used estimatation methods are differentiator and linear regression method : • Typical phrase :“Don’t use differentiator, it emphasizes noise”
Helsinki TIMIT Differentiator Best: -ARCSIN (8.1 %), M=4 Best : -LSF (7.0 %), M=1 Regression Best : -LSF (10.6 %), M=2 Best : -ARCSIN (8.8 %), M=1 Delta Features :Comparison of the Two Estimation Methods
Delta Features :Comparison with the Static Features Discussion About the Delta Features : • Optimum order is small (In most cases M=1,2 neighboring frames) • The differentiator method is better in most cases (surprising result, again!) • Delta features are worse than static features but might provide uncorrelated extra information (for multiparameter recognition) • The commonly used delta-cepstrum gives quite poor results !
Helsinki TIMIT Answer: NO ! FFT-Cepstrum Revisited :Question :Is Log-Compression / Mel-Cepstrum Best ? Please note: Now segment length is reduced down to T=100 vectors, that’s why absolute recognition rates are worse than before(ran out of time for the thesis…)
FFT- vs. LPC-Cepstrum:Question:Is it really that “FFT-cepstrum is more accurate” ? Helsinki TIMIT Answer: NO !(TIMIT shows this quite clearly)
LPC captures more “details” FFT-cepstrum represents “smooth” spectrum The Essential Difference Between the FFT- and LPC-Cepstra ? • FFT-cepstrum approximates the spectrum by linear combination of cosine functions (non-parametric model) • LPC makes a least-squares fit of the all-pole filter to the spectrum (parametric model) • FFT-cepstrum first smoothes the original spectrum by filterbank, whereas LPC filter is fitted directly to the original spectrum However, one might argue that we could drop out the filterbank from FFT-cepstrum ...
All of these indicate indirectly the importance of spectral details and rapid spectral changes General Summary and Discussion • Number of subbands should be high (30-50 for these corpora) • Number of cepstral coefficients (LPC/FFT-based) should high ( 15) • In particular, number of subbands, coefficients, and LPC order are clearly higher than in speech recognition generally • Formants give (surprisingly) good performance • Number of formants should be high ( 8) • In most cases, the differentiator method outperforms the regression method in delta-feature computation
“Philosophical Discussion” • The current knowledge of speaker individuality is far from perfect : • Engineers concentrete on tuning complex feature compensation methods but don’t (necessarily) understand what’s individual in speech • Phoneticians try to find the “individual code” in the speech signal, but they don’t (necessarily) know how to apply engineers’ methods • Why do we believe that speech would be any less individual than e.g. fingerprints ? • Compare the history “fingerprint” and “voiceprint” : • Fingerprints have been studied systematically since the 17th century (1684) • Spectrograph wasn’t invented until 1946 ! How could we possibly claim that we know what speech is with research of less than 60 years? • Why do we believe that human beings are optimal speaker discriminators? Our ear can be fooled already (e.g. MP3 encoding).