250 likes | 363 Views
Reporter: Shih-Hsiang( 士翔 ). Introduction. Speech signal carries information from many sources Not all information is relevant or important for speech recognition Feature extraction (the first crucial step) Acoustic features may greatly affect the performance of a speech recognizer
E N D
Introduction • Speech signal carries information from many sources • Not all information is relevant or important for speech recognition • Feature extraction (the first crucial step) • Acoustic features may greatly affect the performance of a speech recognizer • Discriminability • Robustness • Complexity • MFCCs are used almost as “standard” acoustic parameters in currently available speech recognition systems • Not to cope well with noisy speech • Wiener filtering, spectral subtraction, RASTA, PMC, MLLR …etc. • In this paper, they present differential power spectrum (DPS) for speech recognition
Definition of the differential power spectrum y(t) : received speech signal s(t) : original clean speech signal h(t) : impulse response of the transmission channel x(t) : the noise-free speech signal v(t) : ambient noise assume (0≤n<N, where N is the frame length) power spectrum ω : radian frequency ry(τ) : the short-time autocorrelation
Definition of the differential power spectrum (cont.) assume noise and speech signal are mutually uncorrelated Differential power spectrum (DPS) assume noise and speech signal are mutually uncorrelated (continuous frequency domain)
Definition of the differential power spectrum (cont.) Its discrete counterpart can be approximated in terms of following difference equation where P and O are the orders of the differential equation bl’s some real-valued weighting coefficients 0≤k<K, here K is the length of FFT
Definition of the differential power spectrum (cont.) • D(k) = Y(k) – Y(k+1)
Representing DPS into speech features • Three problems • The selection of proper orders of the difference equations • The determination of weights bl’s • How DPS should be converted into a few parameters • An optimal solution to any of the three listed problems is difficult to achieve • For the first two problems, they proposed three special forms • DPS1: D(k) = Y(k) – Y(k+1) • DPS2: D(k) = Y(k) – Y(k+2) • DPS3: D(k) = Y(k-2) + Y(k-1) – Y(k+1) - Y(k+2) • The third problem is converting DPS into cepstral coefficients • An absolute operation to make negative parts positive • The magnitude of DPS is passed through a mel-frequency filter bank • Logarithmic filter bank outputs are compressed into a feature vector
Comparison with the cepstral liftering technique • If xi is the i-th cepstral coefficient, then the corresponding liftered cepstral coefficient is given by • Various types of lifters are proposed in the literature where Wi define the lifter
Comparison with the cepstral liftering technique (cont.) Effect of cepstral liftering on the performance of a DTW-based speech recognizer
Comparison with the cepstral liftering technique (cont.) • But liftering has no effect in the recognition process Mahalanobis distance - HMM Mahalanobis distance liftered cepstral cofficients are used Weighted Matrix
Comparison with the cepstral liftering technique (cont.) • In DPS based cepstrum
Comparison with the spectral subtraction • SS can be formulated as • For speech recognition, it was found that SS operated in each band-pass filter could yield more consistent improvement for MFCC features against noise β :spectral flooring α:controls the amount of noise subtracted from the noisy signal EY(k) is the output of the kth band-pass filter when Y(k) is passed though the filter attack decay
Experiments • In this paper they conduct a number of speech recognition experiments • Isolated speech recognition • SNR improvement • Connected digits recognition • Phone recognition • Evaluation on AURORA task
Experiments - Isolated speech recognition • TI46 database – an isolated spoken words database (TI) • 16 speakers (8 males / 8 females) • Vocabulary consists • 10 isolated digits from ‘ZERO’ to ‘NINE’ • 26 isolated English alphabets from ‘A’ to ‘Z’ • 10 isolated words including “ENTER, ERASE, GO, HELP, NO, RUBOUT, REPEAT, STOP, START, YES” • 26 utterances of each word from each speaker (10 training /16 testing) • In this experiment, four sets of features are considered • MFCC • DPSCC1 • DPSCC2 • DPSCC3
Experiments - Isolated speech recognition (cont.) • The DPS based features can at least yield comparable performance as the standard MFCCs • For both MFCCs and DPSCCs, the inclusion of dynamic and acceleration features can greatly augment the performance
Experiments - SNR improvement • Clean speech signals are taken from the TI46 database • Take Lynx noise from the NOISEX database • Power spectrum based • DPS based
Experiments - SNR improvement (cont.) Tge average SNRD is approximately 4 dB higher than SNRY
Experiments - Connected digits recognition • TI connected digits database – contains digits string uttered by adult and child speakers • Vocabulary consists • 11 words - 10 digits and an “oh” • Each speaker uttered 77 sequences of these words • Add some noise to the speech signal in the test set, and the training speech is kept clean • wide-band stationary speech noise, machine-gun noise, Lynx noise • Four sets of feature vectors are investigated • MFCC • DPSCC • MFCC + CMN • DPSCC + CMN
Compared with MFCCs, it yields at least comparable performance in clean conditions In most strong noise conditions, DPSCC outperforms MFCC CMN is effective to augment the robustness of both Experiments - Connected digits recognition (cont.)
Experiments - Phone recognition • TIMIT phoneme based continuous speech database • Contains a total of 6300 sentences • 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the US • Perform phonetic recognition on the database over the set of 39 classes that are commonly used for evaluation • Add some noise to the speech signal in the test set, and the training speech is kept clean • wide-band stationary speech noise, machine-gun noise, Lynx noise • Two feature sets are used • MFCC+CMN (39 coefficients) • DPSCC+CMN (39 coefficients)
Experiments - Phone recognition (cont.) • The MFCC and the DPSCC features yield comparable result in clean and weak noise conditions. • DPSCC features slightly outperform the MFCC features in strong noise conditions
Experiments - Evaluation on AURORA task • Noise signals are recorder at different places • suburban train, babble, car, exhibition hall, restaurant, street, airport and train station • Two training modes are defined • Training on clean data only • 8440 utterances (55 male / 55 female) • Signals are filtered with the G.712 characteristic without noise added • Training on clean as well as noisy data (multi-condition) • 8440 utterances and split into 20 subsets (with 422 utterances) • Suburban train, babble, car, and exhibition hall noises are added to 20 subsets at 5 different SNRs (20, 15, 10, 5 dB and the clean condition) • Three test sets are defined • Test Set A 、Test Set B 、Test Set C
Experiments - Evaluation on AURORA task (cont.) • With the use of CMN, the average word error rate is reduced 8.8% • SS used together with the CMN, it increases the average performance by 19.3% • The DPS based cepstrum outperforms MFCC. It also yields a slightly better performance than SS
Discussion and conclusion • DPS can also preserve spectral information to discriminate among different linguistic units (e.g. phonemes and words) • DPS had a higher SNR than the power spectrum, specially for voiced frames • DPS based features should be more resilient to noise than the power spectrum based feature • The DPSCC can yield at least comparable performance when compared to the conventional MFCCs. • In most cases, it outperforms MFCC • Compared to the estimation of MFCC, the extraction of DPSCC requires (K/2-1) more addition (subtraction) and absolute operations for each frame signal • This increase in computational complexity is negligible for today’s computer