A glimpsing model of speech perception

A glimpsing model of speech perception Martin Cooke & Sarah Simpson Speech and Hearing Research Department of Computer Science University of Sheffield http://www.dcs.shef.ac.uk/~martin

speech technology performance falls with the nonstationarity of the noise background … Aurora eval Motivation: The nonstationarity ‘paradox’ Simpson & Cooke (2003)

speech technology performance falls with the nonstationarity of the noise background … … while listeners appear to prefer a nonstationary background (8-12 dB SRT gain) Miller (1947) Motivation:The nonstationarity ‘paradox’ Simpson & Cooke (2003)

Possible factors In a 1-speaker background, listeners can … • … employ organisational cues from the background source to help segregate foreground • … employ schemas for both foreground and background • … benefit from better glimpses of the speech target but: multi-speaker backgrounds have certain advantages … • … less chance of informational masking • … easier enhancement algorithm

Glimpsing opportunities Spectro-temporal glimpse densities % of time-frequency regions with a locally-positive SNR

Glimpsing Informal definition a glimpse is some time-frequency region which contains a reasonably undistorted ‘view’ of local signal properties Precursors • Term used by Miller & Licklider (1950) to explain intelligibility of interrupted speech • Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which demonstrated ‘intelligent’ temporal integration of tone bursts • Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as way of understanding how listeners cope with adverse conditions • Culling & Darwin (1994) developed a glimpsing model to explain double vowel identification for small ΔF0s • de Cheveigné & Kawahara (1999) can be considered a glimpsing model of vowel identification • Close relation to missing data processing (Cooke et al, 1994)

Types of glimpses Comodulated Eg Miller & Licklider (1950) Spectral Eg Warren et al (1995) General uncomodulated Eg Howard-Jones & Rosen (1993), Buss et al (2003)

Evidence from distorted speech e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target level with a constant value. Found intelligibility of 60% when 98% of signal was missing

Glimpsing in natural conditions: the dominance effect Although audio signals add ‘additively’, the occlusion metaphor is more appropriate due to loglike compression in the auditory system Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.

Issues for a glimpsing model What constitutes a useful glimpse? Is sufficient information contained in glimpses? How do listeners detect glimpses? How can they be integrated? Glimpse detection Glimpse integration

Glimpsing study Aims • Determine if glimpses contain sufficient information • Explore definition of useful glimpse • Comparison between listeners and model using natural VCV stimuli • Subset of Shannon et al (1999) corpus V = /a/ C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch } • Background source • reversed multispeaker babbler for N=1, 8 • Allows variation in glimpsing opportunities • 3 SNRs (TMRs): 0, -6 and -12 dB • 12 listeners heard 160 tokens in each condition • 2 repeats X 16 VCVs X 5 male speakers

Identificationresults 1-speaker 8-speaker

Glimpsing model • CDHMM employing missing data techniques • 16 whole-word HMMs • 8 states • 4 component Gaussian mixture per state • Input representation • 10 ms frames of modelled auditory excitation pattern (40 gammatone filters, Hilbert envelope, 8 ms smoothing) • NB: only simultaneous masking is modelled • Training • 8 repetitions of each VCV by 5 male speakers per model • Testing • As for listeners viz. 2 repetitions of each VCV by 5 male speakers • Performance in clean: > 99%

Model performance I: ideal glimpses Ideal glimpses • All time-frequency regions whose local SNR exceeds a threshold • Optimum threshold = 0 dB • For this task, there is more than sufficient information in the glimpsed regions • Listeners perform suboptimally with respect to this glimpse definition 1 8

Model performance:variation in detection threshold Q Can varying the local SNR threshold for glimpse detection prodce a better match? • No choice of local SNR threshold provides good fit to listeners • Closest fit shown (-6 dB) 1 8

Analysis • Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

Model performance: useable glimpses • Definition: glimpsed region must occupy at least N ERBs and T ms • Search over 1-15 ERBs, 10-100 ms, at various detection thresholds • Best match at • 6.3 ERBs (9 channels) • 40 ms • 0 dB local SNR threshold 1 8 • Howard-Jones & Rosen (1993) suggested 2-4 bands limit for uncomodulated glimpsing • Buss et al (2003) found evidence for uncomodulated glimpsing in up to 9 bands

Consonant identification • Reasonable matches overall apart from b, s & z • However, little token-by-token agreement between common listener errors and model errors. • Why?

Factors ‘Confusability’ Audibility of target Informational masking Energetic masking Existence of schemas for target Successful identification Organisational cues in target Existence of schemas for background Organisational cues in background

Measuring energetic masking Approach: resynthesise glimpses alone • Filter, time-reverse, refilter to remove phase distortion • Select regions based on local SNR mask Results • Little difference for 1-speaker background, suggesting relatively low contribution of info masking in this case (due to reversed masker?) • Larger difference for 8-speaker case possibly due to ‘unrealistic’ glimpses 1 8 glimpses alone speech+noise

Comparison with ideal model Results • Ideal model performs well in excess of listeners when supplied with precisely the same information Possible reasons: • Distortions • Glimpses do not occur in isolation: possibility that a noise background will help • Lack of nonsimultaneous masking model will inflate model performance Ideal (model) Ideal? (listeners)

The glimpse decoder • Attempt at a unifying statistical theory for primitive and model-driven processes in CASA • Basic idea: decoder not only determines the most likely speech hypothesis but also decides which glimpses to use • Key advantage: no longer need to rely on clean acoustics! • Can interpret (some) informational masking effects as the incorrect assignment of glimpses during signal interpretation • Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence of other sources”, accepted for Speech Communication

Summary & outlook • Proposed a glimpsing model of speech identification in noise • Demonstrated sufficiency of information in target glimpses, at least for VCV task • Preliminary definition of useful glimpse gives good overall model-listener match • Introduced 2 procedures for measuring the amount of energetic masking (i) via ASR (ii) via glimpse resynthesis • Need nonsimultaneous masking model • Need to isolate affects due to schemas • Repeat using non-reversed speech to introduce more informational masking • Need to quantify affect of distortion in glimpse resynthesis • …

Masking noise can be beneficial Warren et al (1995) demonstrated spectral induction effect with 2 narrow bands of speech with intervening noise fullband Cooke & Cunningham (in prep) Spectral induction with single speech-bands.

Speech modulated noise Speech modulated noise • As in Brungart (2001) • Model results and glimpse distributions indicate increase in energetic masking for this type of masker Natural speech natural, 1 spkr natural, 8 spkr SMN, 1 spkr SMN, 8 spkr Speech modulated noise

1 NAT (model) NAT (listeners) SMN (listeners) 8 SMN (model) Speech modulated noise • Listeners perform better with SMN than predicted on the basis of reduced glimpses (cf SMN model), but not quite as well as they do with natural speech masker • Suggests energetic masking is not the whole story (cf Brungart, 2001), but further work needed to quantify relative contribution of • Release from IM • Absence of background models/cues

A glimpsing model of speech perception