E N D
Abstract We report comparisons between a model incorporating a bank of dual-resonance nonlinear (DRNL) filters and one incorporating a bank of linear gammatone filters. Previous computational models of the auditory system have typically used linear gammatone filters to model the frequency selectivity of the basilar membrane. These filters have been adequate as a first approximation, but lack certain characteristics that have been demonstrated in psychophysical and physiological studies. These include compression, downward shifts in best frequency, and widening of the filter response with increasing level, and tuning curves showing a long low-frequency tail and sharp high-frequency cutoff. The complete model incorporates several stages of processing. English vowels synthesised using the Klatt synthesiser are passed through a pre-emphasis filter modelling the outer/middle ear transfer function. A filterbank models the frequency selectivity of the basilar membrane. Auditory nerve spikes are generated for each frequency channel using a model of inner hair cell/auditory nerve (IHC/AN) function. The spiking activity in each channel is used to generate an autocorrelation function (ACF) to display signal periodicity. The ACFs are summed across all channels to generate a summary autocorrelation function (SACF). The SACF picks up timbral properties of the vowels at delays from 0 to 4.5 ms. The model is run using both a nonlinear bank of filters and a linear bank of filters. The output patterns from the two banks of filters are distinctly different. Each linear filter shows a unique response dominated by its corresponding harmonic. By contrast, adjacent nonlinear filters may show similar responses dominated by the nearest spectral peak that is lower in frequency than the filter’s best frequency. This different pattern of responses in the filter channel is reflected in the ACF channels and therefore the SACF. In addition, the nonlinear model retains the same pattern of peaks and troughs in the SACF when the signal level is varied between 50 and 90 dB SPL, while the linear model shows large changes in the SACF at different levels. This is because the IHC/AN stage in the nonlinear model becomes saturated at low (< 50 dB) levels across all channels. The IHC/AN stage in the linear model only saturates slowly with increasing level in the spectral troughs, so the overall spike pattern, and therefore the SACF, changes as the level varies. We anticipate that the level invariance of the nonlinear model will facilitate vowel recognition in future modelling work. This investigation was carried out using the Development System for Auditory Modelling (DSAM).
We report comparisons of the responses to synthesised English vowels of two computational models of auditory processing. These two models differ only in the filters that are used to simulate the basilar membrane. The first, referred to as the linear model, incorporates a bank of linear (Gammatone) auditory filters. The second, referred to as the nonlinear model, incorporates a bank of Dual Resonance Nonlinear (DRNL) auditory filters. The linear model is typical of previous computational models in using gammatone filters (e.g Meddis & Hewitt, 1992; De Cheveigne, 1997). A problem is that linear filters lack certain characteristics known from physiology and psychophysics (e.g. Plack & Oxenham, 2000; Rhode & Cooper, 1996) : Compression with increasing level Downward shift in best frequency (BF) with increasing level Widening of filter bandwidth with increasing level Long low frequency tail and sharp high frequency cutoff. It is desirable to include such characteristics in modelling work. For this reason a nonlinear model is presented which uses DRNL filters to introduce the above characteristics, in order to see how they affect an established vowel representation, the summary autocorrelation (SACF). Of particular interest are changes in the SACFs with vowel intensity. Introduction
A computational model comprising several sequential processing stages. Model Description 1. Stimulus Input 2. Outer/middle ear filter 6. Summary autocorrelation (SACF) 4. Inner hair cell excitation / auditory nerve spiking 5. Autocorrelation function (ACF) 3. Filterbank - Linear or Nonlinear • Stimulus input stage: English vowels synthesised using a Klatt (1980) synthesiser at a 10 kHz sampling rate. • Subsequent stages are described on the following pages. • All following figures show responses to vowels with a fundamental frequency of 100 Hz.
Outer/middle ear pre-emphasis filter: Single linear bandpass filter 450 Hz - 5000 Hz 3dB down points. Basilar membrane (BM) filtering: Bank of 100 filters, logarithmically spaced Centre frequencies 100 - 4000 Hz Either linear or nonlinear filters. Each filter channel converts the pressure at its frequency to BM velocity. Synthesised vowel ‘Ah’ has formants at 650, 950, 2950, 3300 and 3850 Hz. Linear model clearly picks out the spectral peaksin the signal. Nonlinear model shows a response distributed across a wider range of filters. Outer/Middle Ear and Basilar Membrane Filtering Linear Model: ‘Ah’ at 50 dB Nonlinear Model: ‘Ah’ at 50 dB
Meddis2000 IHC/AN module (Sumner et al, 2000) converts BM velocity into auditory nerve spikes. BM velocity generates excitation in the inner hair cell model, which leads to spike generation in the auditory nerve model. AN fibre refractory period of 1 ms. 170 high spontaneous rate (50 spike/s) auditory nerve fibres per channel. Limited dynamic range of 30 dB. IHC Excitation / AN Spike Generation AN spiking (linear model): ‘Ah’ at 50 dB
Correlates IHC/AN output to itself at varying lags to produce an autocorrelation function (ACF) for each channel. The ACF detects signal periodicity for each channel - the dominant period(s) in each channel produce a peak at the corresponding lag. All ACF channels are summed at each lag to give the summary autocorrelation function (SACF). Note that the peak at 10 ms corresponds to the fundamental frequency of the vowel (100 Hz). Autocorrelation & Summary Autocorrelation Functions ACF (linear model): ‘Ah’ at 50 dB
Comparison of the ACF plot for the vowels ‘Ah’ and ‘Ee’ at 50 and 90 dB SPL. Individual channels generally respond differently to their neighbours. At 50 dB SPL the ACFs for each vowel are distinctive due to only the spectral peaks being active. With increasing signal levels the ACF changes as more channels become active. At high signal levels all the channels become saturated. The ACFs for each vowel therefore become more similar as spectral information is lost. ACF Variations with Signal Level (Linear Model) ‘Ah’ ‘Ee’ 50 dB 90 dB
Comparison of the ACF plot for the vowels ‘Ah’ and ‘Ee’ at 50 dB SPL and 90 dB SPL. Bands of coherent activity are visible in the figures. The ACFs for each vowel are distinctive at both 50 and 90 dB SPL. With increasing signal levels the ACF changes as more channels become active. ACF Variations with Signal Level (NonLinear Model) ‘Ah’ ‘Ee’ 50 dB 90 dB
First formants correspond to lags of 1.54 ms (650 Hz) for ‘Ah’ and 4 ms (250 Hz) for ‘Ee’. Linear model SACFs pick out no strong formant features at 50 dB, only the contribution of 200 and 300 Hz harmonics (corresponding to lags of 5 ms and 3.3 ms respectively) at 50 and 90 dB. SACFs for each vowel are not significantly different for linear model at 90 dB. Nonlinear model SACFs do vary, but show peaks at the same lag across sound levels, corresponding to the first formants of the vowels. SACF variations with signal level Linear Model Nonlinear Model Ah Ee Ah Ee 50 dB 90 dB
Results • The output patterns from the two banks of filters are distinctly different. Each linear filter shows a unique response dominated by its corresponding harmonic. Adjacent nonlinear filters may show similar responses that are dominated by the nearest spectral peak that is lower in frequency than the filter’s best frequency. • The linear model does not highlight vowel formats in the SACF because few channels respond to the formant harmonics. This is the case even though the filterbank response at 50 dB shows the spectral peaks. • At 90 dB, nearly all the channels in the linear model are saturated and so spectral information about the vowel formants becomes lost. The SACF is dominated by the more densely packed low-frequency channels, regardless of the spectral shape of the stimulus. • In contrast, the nonlinear model retains a representation of the vowel formants across signal levels. Within each filter channel the strongest harmonic drives the activity in the channel. • Many channels in the nonlinear model are saturated at 50 dB. There is less growth in activity with higher levels, just a spread of response to the first formant of the vowel, resulting in a more level-invariant response.
Conclusion • Preliminary results suggest that the nonlinear model generates a more level-invariant representation of the vowel formants than the linear model allows, a property that is useful in respect to vowel identification. • Although the linear filters are invariant with level, the combination of linear filters and a (nonlinear) hair cell model is not invariant with level. • Nonlinear filters vary with level but the combination of nonlinear filters with a (nonlinear) hair cell model may be invariant with level. • The linear model does not reflect vowel formants in the SACF at high signal levels, due to saturation in all channels. It would not be expected to distinguish between different vowels presented at high levels (90 dB). • The nonlinear model in contrast does represent vowel formants and should therefore distinguish between different vowels. Ongoing work is investigating this prediction. • We anticipate that the level invariance demonstrated here by the nonlinear model will improve vowel identification.
References • De Cheveigne, A. (1997). Concurrent vowel identification. III. A neural model of harmonic interference cancellation. Journal of the Acoustical Society of America, 101, 2857-2865. • Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America, 67, 971-990. • Meddis, R. & Hewitt, M. J. (1991). Virtual pitch and phase sensitivity of a computer model of the auditory periphery. I: Pitch Identification. Journal of the Acoustical Society of America, 89, 2866-2882. • Plack, C. J. & Oxenham, A. J. (2000). Basilar-membrane nonlinearity estimated by pulsation threshold. Journal of the Acoustical Society of America, 107, 501-507. • Rhode, W. S. & Cooper, N. P. (1996). Nonlinear mechanics in the apical turn of the chinchilla cochlea in vivo. Auditory Neuroscience, 3, 101-121. • Sumner, C. J., Meddis, R. & O’Mard, L. P. (2000). An enhanced computational model of the inner hair cell auditory-nerve complex. British Journal of Audiology, 34, 117. Acknowledgments This work was carried out using the Development System for Auditory Modelling (DSAM), developed by Dr. Lowel P. O’Mard.