Wen-Yi Chu Department of Computer Science & Information Engineering

Subband Feature Statistics Normalization Techniques Based on a Discrete Wavelet Transform for Robust Speech RecognitionJeih-weih Hung, Member, IEEE, and Hao-Teng Fan Wen-Yi Chu Department of Computer Science & Information Engineering National Taiwan Normal University

Outline • Introduction • Subband Feature Statistics Normalization Method • Experimental Setup • Experimental Results And Discussions • Concluding Remarks And Feature Works

Introduction • This letter proposes a novel scheme that applies feature statistics normalization techniques for robust speech recognition. • Partially motivated by the above observations, we propose decomposing the feature stream into subband streams and then performing the normalization process on some or all of the subband streams separately. The new feature stream is reconstructed by properly integrating all substreams. • In particular, the above decomposition and reconstruction procedures are based on the well-known discrete wavelet transform (DWT).

Subband Feature Statistics Normalization Method(1/4) • Discrete Wavelet Transform (DWT) • x[n]：離散的輸入信號 • g[n]：low pass filter低通濾波器，可以將輸入信號的高頻部份濾掉而輸出低頻部份。 • h[n]：high pass filter高通濾波器，與低通濾波器相反，濾掉低頻部份而輸出高頻部份。 • Q：downsampling filter降頻濾波器，使輸出信號的頻率變成輸入信號頻率的1/Q。此處舉例Q=2。

Subband Feature Statistics Normalization Method(2/4) • We consider the mel-scaled filter-bank cepstral coefficients (MFCC) for speech recognition.

Subband Feature Statistics Normalization Method(3/4) • Given that the frame rate of is in Hz, and that is within the modulation spectral band , the band range of the subband stream can be approximately represented as • If MVN is selected as the normalization method, then the relationship between and is • If HEQ is selected as the normalization method, then the relationship between and is

Subband Feature Statistics Normalization Method(4/4) • Finally, we reconstruct the new feature stream for the utterance from the updated subband streams together with the other unchanged streams using the -level inverse discrete wavelet transform (IDWT), as depicted on the right side of Fig. 1. • In SB-MVN, the streams corresponding to different subbands have different target means and variances. A similar condition holds for SB-HEQ : the streams for different subbands employ different target distribution functions. • In the proposed methods, more subbands with a narrower bandwidth are at the lower frequencies. • Due to the down-sampling operation in DWT, the total number of data points of all of the subband streams is approximately equivalent to that of the original stream.

Experimental Setup • Each feature sequence for each utterance in both the training and testing sets is decomposed into L subband streams. For each subband, the features of all of the utterances in the training set are used to estimate the required target statistics, which will be used for each utterance in the training and testing sets. • The parameter L is preliminarily set to 4, which indicates that a three-level DWT is performed, and the frequency ranges for the four octave subband streams are approximately , , and , respectively.

Experimental Results And Discussions(1/3) • The results in Fig. 2 indicate that, all the normalization methods provide significant accuracy improvement for all noise types.

Experimental Results And Discussions (2/3) • These results are somewhat consistent with the observation in past research that the modulation frequency components between 1 Hz and 16 Hz are particularly important for speech recognition. • These results imply that, given a fixed number of subbands, placing more subbands in lower frequencies is more helpful in the proposed methods.

Experimental Results And Discussions(3/3) • 11

Concluding Remarks And Feature Works • In this letter, we propose performing a normalization process on the subband feature streams and show that the subband MVN and HEQ are superior to the conventional full-band MVN and HEQ. • In future works, we will integrate other normalization techniques such as HOCMN and CSN in the subband processing scheme to determine if better performance can be achieved. • Besides, we will apply other types of wavelet functions in the DWT and IDWT processes of our approach to investigate if a different analysis/synthesis operation will influence the recognition accuracy.

Wen-Yi Chu Department of Computer Science & Information Engineering