160 likes | 258 Views
Speech Discrimination Based on Multiscale Spectro–Temporal Modulations. Nima Mesgarani, Shihab Shamma, University of Maryland. Malcolm Slaney IBM. Reporter : Chen, Hung-Bin. Outline. Introduction VAD ( Voice Activity Detection and Speech Segmentation )
E N D
Speech Discrimination Based on Multiscale Spectro–TemporalModulations Nima Mesgarani, Shihab Shamma, University of Maryland MalcolmSlaney IBM Reporter : Chen, Hung-Bin
Outline • Introduction VAD( Voice Activity Detection and Speech Segmentation ) • discriminate speech from non-speech which consists of noise sounds • multiscale spectro-temporal modulation features extracted using a model of auditory cortex • Two state-of-the-art systems • Robust Multifeature Speech/Music Discriminator • Robust Speech Recognition In Noisy Environments • Auditory model • Experimental results • Summary and Conclusions
Introduction - VAD • significance • Speech recognition systems designed for real world conditions,a robust discrimination of speech from other sounds is a crucial step. • advantage • Speech discrimination can also be used for coding or telecommunication applications. • proposed system • a feature set inspired by investigations of various stages of the auditory system
Two state-of-the-art systems • Multi–feature System • Features • Thirteen features in Time, Frequency, and Cepstrum domain are used to model speech and music (noise). • Classification • A Gaussian mixture model (GMM) models each class of data as the union of several Gaussian clusters in the feature space. • Reference: • [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997.
Two state-of-the-art systems (cont) • Voicing–energy System • Features • frame-by-frame maximum autocorrelation and log-energy features is making the speech/non-speech decision. • PLP • LDA+MLLT • Segmentation • use an HMM-based segmentation procedure with two models, one for speech segments and one for non-speech segments. • Reference: • [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002,
Auditory model • The computational auditory model is based on neurophysiological, biophysical, and psychoacoustical investigations at various stages of the auditory system. • transformation of the acoustic signal into an internal neural representation (auditory spectrogram)
Auditory model (cont) • a complex spatiotemporal pattern • vibrations along the basilar membrane of the cochlea • 3–step process • highpass filter, by an instantaneous nonlinear compression • lowpass filter (hair cell membrane leakage) • detects discontinuities in the responses across the tonotopic axis of the auditory nerve array • computationally via a bank of modulation-selective filters centered at each frequency along the tonotopic axis.
Auditory model (cont) • Sound is analyzed by a model of the cochlea (depicted on the left) consisting of a bank of 128 constant-Q bandpass .lters with center frequencies equally spaced on a logarithmic frequency axis
Multilinear Analysis Of Cortical Representation • auditory model is a multidimensional array. • the time dimension is averaged over a given time window which results in a three mode tensor for each time window with each elements representing the overall modulations at corresponding frequency, rate and scale (128(frequency channels) ×26 (rates) ×6 (scales)
Multilinear Analysis Of Cortical Representation (cont) • Using multi-dimensional PCA to tailor the amount of reduction in each subspace independently. • To generalize the multidimensional tensors concept, we consider a generalization of SVD (Singular Value Decomposition) to tensors. • D = S×1Ufrequency×2Urate×3Uscale×4Usamples • D : The resulting data • S : I1 × I2 × ... × IN • Original : (128(frequency channels) ×26 (rates) ×6 (scales) • The resulting tensorwhose retained singular vectors in each mode ( 7 for frequency , 5 for rate and 3 for scale dimensions) is used for classification. • Classification was performed using a Support Vector Machine (SVM)
Experimental Results • Audio Database from TIMIT • Training data : 300 samples • Testing data : 150 different sentences spoken by 50 different speakers (25 male, 25 female) • training and test sets were different. • To make the non-speech class • from BBC Sound Effects audio CD, RWC Genre Database and Noisex and Aurora databases were assembled together. • The training set • 300 speech and 740 non-speech samples • the testing set • 150 speech and 450 non-speech samples • The audio length is equal.
Experimental Results (cont) • speech detection/discrimination • Table 1 and 2 shows the effect
Experimental Results (cont) • tests white and pink noise were added to speech with specified signal to noise ratio (SNR).
Experimental Results (cont) • different levels of reverberation on the performance
Summary and Conclusions • This work is but one in a series of efforts at incorporating multi–scale cortical representations (and more broadly, perceptual insights) in a variety of audio and speech processing applications. • Applications such as • automatic classification • segmentation of animal sounds • an efficient encoding of speech and music
Reference • Two state-of-the-art systems • [1] E. Scheirer, M. Slaney, ”Construction and evaluation of a robust multifeature speech/music discriminator”, ICASSP’97, 1997. • [2] B. Kingsbury, G. Saon, L. Mangu, M. Padmanabhan and R. Sarikaya, ”Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system”, ICASSP 2002, vol. I, pp. 53–56, 2002. • Central Auditory System • [4] K. Wang and S. A. Shamma, ”Spectral shape analysis in the central auditory system”, IEEE Trans. Speech Audio Proc. vol. 3 (5), pp. 382–395, 1995. • [6] M. Elhilali, T. Chi and S. A. Shamma, ”A spectro-temporal modulation index (STMI) for assessment of speech intelligibility”, Speech comm., vol. 41, pp. 331–348, 2003. • Auditory cortical representation of complex acoustic spectra as inferred from the ripple analysis method • SHIHAB A. SHAMMA • http://www.isr.umd.edu/People/faculty/Shamma.html