Speech Segregation

DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/ Speech Segregation

Outline of presentation • Introduction: Speech segregation problem • Auditory scene analysis (ASA) • Speech enhancement • Speech segregation by computational auditory scene analysis (CASA) • Segregation as binary classification • Concluding remarks

Real-world audition What? • Speech message speaker age, gender, linguistic origin, mood, … • Music • Car passing by Where? • Left, right, up, down • How close? Channel characteristics Environment characteristics • Room reverberation • Ambient noise

additive noise from other sound sources channel distortion reverberationfrom surface reflections Sources of intrusion and distortion

Cocktail party problem • Term coined by Cherry • “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957) • “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992) • Ball-room problem by Helmholtz • “Complicated beyond conception” (Helmholtz, 1863) • Speech segregation problem

Listener performance Speech reception threshold (SRT) • The speech-to-noise ratio needed for 50% intelligibility • Each 1 dB gain in SRT corresponds to 5-10% increase in intelligibility (Miller et al., 1951) dependent upon materials Source: Steeneken (1992)

Source: Wang and Brown (2006) Effects of competing source SRT Difference (23 dB!)

Some applications of speech segregation • Robust automatic speech and speaker recognition • Processor for hearing prosthesis • Hearing aids • Cochlear implants • Audio information retrieval

Approaches to speech segregation Monaural approaches Speech enhancement CASA Focus of this tutorial Microphone-array approaches Spatial filtering (beamforming) Extract target sound from a specific spatial direction with a sensor array Limitation: Configuration stationarity. What if the target switches or changes location? Independent component analysis Find a demixing matrix from mixtures of sound sources Limitation: Strong assumptions. Chief among them is stationarity of mixing matrix

Part II: Auditory scene analysis Human auditory system How does the human auditory system organize sound? Auditory scene analysis account

Auditory periphery A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers

Beyond the periphery • The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system • In comparison to the auditory periphery, central parts of the auditory system are less understood • Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions) The auditory system (Source: Arbib, 1989) The auditory nerve

Auditory scene analysis Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990) From acoustic events to perceptual streams Two conceptual processes of ASA: Segmentation. Decompose the acoustic mixture into sensory elements (segments) Grouping. Combine segments into streams, so that segments in the same stream originate from the same source

Simultaneous organization Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization: • Proximity in frequency (spectral proximity) • Common periodicity • Harmonicity • Temporal fine structure • Common spatial location • Common onset (and to a lesser degree, common offset) • Common temporal modulation • Amplitude modulation (AM) • Frequency modulation (FM) • Demo:

Sequential organization Sequential organization groups sound components across time. ASA cues for sequential organization: • Proximity in time and frequency • Temporal and spectral continuity • Common spatial location; more generally, spatial continuity • Smooth pitch contour • Smooth format transition? • Rhythmic structure Demo: streaming in African xylophone music • Note in pentatonic scale

Primitive versus schema-based organization • Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception – feature-based or bottom-up • It is domain-general, and exploits intrinsic structure of environmental sound • Grouping cues described earlier are primitive in nature • Schema-driven grouping. Learned knowledge about speech, music and other environmental sounds – model-based or top-down • It is domain-specific, e.g. organization of speech sounds into syllables

Organization in speech: Spectrogram “… pure pleasure … ” continuity onset synchrony offset synchrony harmonicity

Interim summary of ASA • Auditory peripheral processing amounts to a decomposition of the acoustic signal • ASA cues essentially reflect structural coherence of natural sound sources • A subset of cues believed to be strongly involved in ASA • Simultaneous organization: Periodicity, temporal modulation, onset • Sequential organization: Location, pitch contour and other source characteristics (e.g. vocal tract)

Part III. Speech enhancement • Speech enhancement aims to remove or reduce background noise • Improve signal-to-noise ratio (SNR) • Assumes stationary noise or at least that noise is more stationary than speech • A tradeoff between speech distortion and noise distortion (residual noise) • Types of speech enhancement algorithms • Spectral subtraction • Wiener filtering • Minimum mean square error (MMSE) estimation • Subspace algorithms • Material in this part is mainly based on Loizou (2007)

Spectral subtraction It is based on a simple principle: Assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum The noise spectrum can be estimated (and updated) during periods when the speech signal is absent or when only noise is present It requires voice activity detection or speech pause detection

Basic principle In the signal domain y(n) = x(n) + d(n) x: speech signal; d: noise; y: noisy speech In the DFT domain Y(ω) = X(ω) + D(ω) Hence we have the estimated signal magnitude spectrum To ensure nonnegative magnitudes, which can happen due to noise estimation errors, half-wave rectification is applied

Basic principle (cont.) Assuming that speech and noise are uncorrelated, we have the estimated signal power spectrum In general Again, half-wave rectification needs to be applied

Flow diagram Noise estimation/ update Noisy Speech FFT + Phase information Enhanced Speech IFFT

Effects of half-wave rectification

Musical noise Isolated peaks cause musical noise

Over-subtraction to reduce musical noise By over-subtracting the noise spectrum, we can reduce the amplitude of isolated peaks and in some cases eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum For that reason, spectral flooring is used to “fill in” the spectral valleys α is over-subtraction factor (α > 1), and β is spectral floor parameter (β< 1)

Effects of parameters: Sound demo Half-wave rectification:α =1, β= 0 α =3, β= 0 α =8, β= 0 α =8, β= 0.1 α =8, β= 1 α =15, β= 0 Noisy sentence (+5 dB SNR) Original (clean) sentence

Wiener filter Aim: To find the optimal filter that minimizes the mean square error between the desired signal (clean signal) and the estimated output Input to this filter: Noisy speech Output of this filter: Enhanced speech

Wiener filter in frequencydomain Wiener filter for noise reduction H(ω) denotes the filter Minimizing mean square error between filtered noisy speech and clean speech leads to for frequency ωk Pxx(ωk ): power spectrum of x(n) Pdd(ωk ): power spectrum of d(n)

Wiener filter in terms of a priori SNR Define a priori SNR at frequency ωk: Wiener filter becomes More attenuation at lower SNR and less attenuation at higher SNR

Iterative Wiener filtering Optimal Wiener filter depends on input signal power spectrum, which is not available. In practice, we can estimate the Wiener filter iteratively We can consider the following procedure at iteration i to estimate H(w): Step 1: Obtain an estimate of the Wiener filter based on the enhanced signal obtained at iteration i Initialize with noisy speech signal Step 2: Filter the noisy signal through the newly obtained Wiener filter according to: to get the new enhanced signal, . Repeat the above procedure

MMSE estimator The Wiener filter is the optimal (in the mean square error sense) complex spectrum estimator, not the optimal magnitude spectrum estimator Ephraim and Malah (1984) proposed an MMSE estimator which is the optimal magnitude spectrum estimator Unlike the Wiener estimator, the MMSE estimator does not require a linear model between the observed data and the estimator, but assumes the probability distributions of speech and noise DFT coefficients: Fourier transform coefficients (real and imaginary parts) have a Gaussian probability distribution. The mean of the coefficients is zero, and the variances of the coefficients are time-varying due to the nonstationarity of speech Fourier transform coefficients are statistically independent, and hence uncorrelated

MMSE estimator (cont.) In the frequency domain: Y(ωk) = X(ωk) + D(ωk) or The MMSE derivation leads to In(.) is the modified Bessel function of order n a posteriori SNR:

MMSE gain function MMSE spectral gain function

Gain for given a prior SNR

Gain for given a posteriori SNR

Estimating a priori SNR The suppression curves suggest that the posteriori SNR has a small effect and the a priori SNR is the main factor influencing suppression The a priori SNR can be estimated recursively (frame-wise) using the so-called “decision-directed” approach at frame m: 0 < a < 1, and a = 0.98 is found to work well

Other remarks and sound demo It is noted that when the a priori SNR is estimated using the “decision-directed” approach, the enhanced speech has no “musical noise” A log-MMSE estimator also exists, which might be perceptually more meaningful Sound demo: Noisy sentence (5 dB SNR): MMSE estimator: Log-MMSE estimator:

Subspace-based algorithms This class of algorithms is based on singular value decomposition (SVD) or eigenvalue decomposition of either data matrices or covariance matrices The basic idea behind the SVD approach is that the singular vectors corresponding to the largest singular values contain speech information, while the remaining singular vectors contain noise information Noise reduction is therefore accomplished by discarding the singular vectors corresponding to the smallest singular values

Subjective evaluations In terms of speech quality, a subset of algorithms improve the overall quality in a few conditions against the unprocessed condition. No algorithm produces improvement in multitalker babble In terms of intelligibility, no algorithm produces significant improvement over unprocessed noisy speech

Interim summary on speech enhancement • Algorithms are derived analytically • Optimization theory • Noise estimation is key • These algorithms are particularly needed for highly non-stationary environments • Speech enhancement algorithms cannot deal with multitalker mixtures • Inability to improve speech intelligibility

Part IV. CASA-based speech segregation Fundamentals of CASA for monaural mixtures CASA for speech segregation Feature-based algorithms Model-based algorithms

Cochleagram: Auditory spectrogram Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleagram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent A waveform signal can be constructed (inverted) from a cochleagram Spectrogram Cochleagram

Neural autocorrelation for pitch perception Licklider (1951)

Correlogram Short-term autocorrelation of the output of each frequency channel of the cochleagram Peaks in summary correlogram indicate pitch periods (F0) A standard model of pitch perception Correlogram & summary correlogram of a vowel with F0 of 100 Hz

Onset and offset detection An onset (offset) corresponds to a sudden intensity increase (decrease), which can be detected by taking the time derivative of the intensity To reduce intensity fluctuations, Gaussian smoothing (low-pass filtering) is typically applied (as in edge detection for image analysis): Note that , where s(t) denotes intensity and

Onset and offset detection (cont.) Hence onset and offset detection is a three-step procedure Convolve the intensity s(t) with G' to obtain O(t) Identify the peaks and the valleys of O(t) Onsets are those peaks above a certain threshold, and offsets are those valleys below a certain threshold Onsets Offsets

Segmentation versus grouping Mirroring Bregman’s two-stage conceptual model, a CASA model generally consists of a segmentation stage and a subsequent grouping stage Segmentation stage decomposes an acoustic scene into a collection of segments, each of which is a contiguous region in the cochleagram with energy primarily from one source Based on cross-channel correlation that encodes correlated responses (temporal fine structure) of adjacent filter channels, and temporal continuity Based on onset and offset analysis Grouping aggregates segments into streams based on various ASA cues

Cross-channel correlation for segmentation Correlogram and cross-channel correlation for a mixture of speech and trill telephone Segments generated based on cross-channel correlation and temporal continuity

Ideal binary mask A main CASA goal is to retain the parts of a mixture where target sound is stronger than the acoustic background (i.e. to mask interference by the target), and discard the other parts (Hu & Wang, 2001; 2004) What a target is depends on intention, attention, etc. In other words, the goal is to identify the ideal binary mask (IBM), which is 1 for a time-frequency (T-F) unit if the SNR within the unit exceeds a threshold, and 0 otherwise It does not actually separate the mixture! More discussion on the IBM in Part V

Speech Segregation

Speech Segregation

Presentation Transcript

An Auditory Scene Analysis Approach to Speech Segregation

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation

Speech Segregation Based on Sound Localization

Segregation

SEGREGATION

Segregation

SEGREGATION

Segregation

Segregation

Segregation

Segregation

Speech Segregation Based on Oscillatory Correlation

Segregation

Segregation

Segregation

Segregation

Segregation

Segregation

Auditory Segmentation and Unvoiced Speech Segregation

Segregation

Segregation