1.06k likes | 1.46k Views
DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/. Speech Segregation. Outline of presentation. Introduction: Speech segregation problem Auditory scene analysis (ASA) Speech enhancement
E N D
DeLiang Wang Perception & Neurodynamics Lab Ohio State University http://www.cse.ohio-state.edu/pnl/ Speech Segregation
Outline of presentation • Introduction: Speech segregation problem • Auditory scene analysis (ASA) • Speech enhancement • Speech segregation by computational auditory scene analysis (CASA) • Segregation as binary classification • Concluding remarks
Real-world audition What? • Speech message speaker age, gender, linguistic origin, mood, … • Music • Car passing by Where? • Left, right, up, down • How close? Channel characteristics Environment characteristics • Room reverberation • Ambient noise
additive noise from other sound sources channel distortion reverberationfrom surface reflections Sources of intrusion and distortion
Cocktail party problem • Term coined by Cherry • “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957) • “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992) • Ball-room problem by Helmholtz • “Complicated beyond conception” (Helmholtz, 1863) • Speech segregation problem
Listener performance Speech reception threshold (SRT) • The speech-to-noise ratio needed for 50% intelligibility • Each 1 dB gain in SRT corresponds to 5-10% increase in intelligibility (Miller et al., 1951) dependent upon materials Source: Steeneken (1992)
Source: Wang and Brown (2006) Effects of competing source SRT Difference (23 dB!)
Some applications of speech segregation • Robust automatic speech and speaker recognition • Processor for hearing prosthesis • Hearing aids • Cochlear implants • Audio information retrieval
Approaches to speech segregation Monaural approaches Speech enhancement CASA Focus of this tutorial Microphone-array approaches Spatial filtering (beamforming) Extract target sound from a specific spatial direction with a sensor array Limitation: Configuration stationarity. What if the target switches or changes location? Independent component analysis Find a demixing matrix from mixtures of sound sources Limitation: Strong assumptions. Chief among them is stationarity of mixing matrix
Part II: Auditory scene analysis Human auditory system How does the human auditory system organize sound? Auditory scene analysis account
Auditory periphery A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers
Beyond the periphery • The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system • In comparison to the auditory periphery, central parts of the auditory system are less understood • Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions) The auditory system (Source: Arbib, 1989) The auditory nerve
Auditory scene analysis Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990) From acoustic events to perceptual streams Two conceptual processes of ASA: Segmentation. Decompose the acoustic mixture into sensory elements (segments) Grouping. Combine segments into streams, so that segments in the same stream originate from the same source
Simultaneous organization Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization: • Proximity in frequency (spectral proximity) • Common periodicity • Harmonicity • Temporal fine structure • Common spatial location • Common onset (and to a lesser degree, common offset) • Common temporal modulation • Amplitude modulation (AM) • Frequency modulation (FM) • Demo:
Sequential organization Sequential organization groups sound components across time. ASA cues for sequential organization: • Proximity in time and frequency • Temporal and spectral continuity • Common spatial location; more generally, spatial continuity • Smooth pitch contour • Smooth format transition? • Rhythmic structure Demo: streaming in African xylophone music • Note in pentatonic scale
Primitive versus schema-based organization • Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception – feature-based or bottom-up • It is domain-general, and exploits intrinsic structure of environmental sound • Grouping cues described earlier are primitive in nature • Schema-driven grouping. Learned knowledge about speech, music and other environmental sounds – model-based or top-down • It is domain-specific, e.g. organization of speech sounds into syllables
Organization in speech: Spectrogram “… pure pleasure … ” continuity onset synchrony offset synchrony harmonicity
Interim summary of ASA • Auditory peripheral processing amounts to a decomposition of the acoustic signal • ASA cues essentially reflect structural coherence of natural sound sources • A subset of cues believed to be strongly involved in ASA • Simultaneous organization: Periodicity, temporal modulation, onset • Sequential organization: Location, pitch contour and other source characteristics (e.g. vocal tract)
Part III. Speech enhancement • Speech enhancement aims to remove or reduce background noise • Improve signal-to-noise ratio (SNR) • Assumes stationary noise or at least that noise is more stationary than speech • A tradeoff between speech distortion and noise distortion (residual noise) • Types of speech enhancement algorithms • Spectral subtraction • Wiener filtering • Minimum mean square error (MMSE) estimation • Subspace algorithms • Material in this part is mainly based on Loizou (2007)
Spectral subtraction It is based on a simple principle: Assuming additive noise, one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum The noise spectrum can be estimated (and updated) during periods when the speech signal is absent or when only noise is present It requires voice activity detection or speech pause detection
Basic principle In the signal domain y(n) = x(n) + d(n) x: speech signal; d: noise; y: noisy speech In the DFT domain Y(ω) = X(ω) + D(ω) Hence we have the estimated signal magnitude spectrum To ensure nonnegative magnitudes, which can happen due to noise estimation errors, half-wave rectification is applied
Basic principle (cont.) Assuming that speech and noise are uncorrelated, we have the estimated signal power spectrum In general Again, half-wave rectification needs to be applied
Flow diagram Noise estimation/ update Noisy Speech FFT + Phase information Enhanced Speech IFFT
Musical noise Isolated peaks cause musical noise
Over-subtraction to reduce musical noise By over-subtracting the noise spectrum, we can reduce the amplitude of isolated peaks and in some cases eliminate them altogether. This by itself, however, is not sufficient because the deep valleys surrounding the peaks still remain in the spectrum For that reason, spectral flooring is used to “fill in” the spectral valleys α is over-subtraction factor (α > 1), and β is spectral floor parameter (β< 1)
Effects of parameters: Sound demo Half-wave rectification:α =1, β= 0 α =3, β= 0 α =8, β= 0 α =8, β= 0.1 α =8, β= 1 α =15, β= 0 Noisy sentence (+5 dB SNR) Original (clean) sentence
Wiener filter Aim: To find the optimal filter that minimizes the mean square error between the desired signal (clean signal) and the estimated output Input to this filter: Noisy speech Output of this filter: Enhanced speech
Wiener filter in frequencydomain Wiener filter for noise reduction H(ω) denotes the filter Minimizing mean square error between filtered noisy speech and clean speech leads to for frequency ωk Pxx(ωk ): power spectrum of x(n) Pdd(ωk ): power spectrum of d(n)
Wiener filter in terms of a priori SNR Define a priori SNR at frequency ωk: Wiener filter becomes More attenuation at lower SNR and less attenuation at higher SNR
Iterative Wiener filtering Optimal Wiener filter depends on input signal power spectrum, which is not available. In practice, we can estimate the Wiener filter iteratively We can consider the following procedure at iteration i to estimate H(w): Step 1: Obtain an estimate of the Wiener filter based on the enhanced signal obtained at iteration i Initialize with noisy speech signal Step 2: Filter the noisy signal through the newly obtained Wiener filter according to: to get the new enhanced signal, . Repeat the above procedure
MMSE estimator The Wiener filter is the optimal (in the mean square error sense) complex spectrum estimator, not the optimal magnitude spectrum estimator Ephraim and Malah (1984) proposed an MMSE estimator which is the optimal magnitude spectrum estimator Unlike the Wiener estimator, the MMSE estimator does not require a linear model between the observed data and the estimator, but assumes the probability distributions of speech and noise DFT coefficients: Fourier transform coefficients (real and imaginary parts) have a Gaussian probability distribution. The mean of the coefficients is zero, and the variances of the coefficients are time-varying due to the nonstationarity of speech Fourier transform coefficients are statistically independent, and hence uncorrelated
MMSE estimator (cont.) In the frequency domain: Y(ωk) = X(ωk) + D(ωk) or The MMSE derivation leads to In(.) is the modified Bessel function of order n a posteriori SNR:
MMSE gain function MMSE spectral gain function
Estimating a priori SNR The suppression curves suggest that the posteriori SNR has a small effect and the a priori SNR is the main factor influencing suppression The a priori SNR can be estimated recursively (frame-wise) using the so-called “decision-directed” approach at frame m: 0 < a < 1, and a = 0.98 is found to work well
Other remarks and sound demo It is noted that when the a priori SNR is estimated using the “decision-directed” approach, the enhanced speech has no “musical noise” A log-MMSE estimator also exists, which might be perceptually more meaningful Sound demo: Noisy sentence (5 dB SNR): MMSE estimator: Log-MMSE estimator:
Subspace-based algorithms This class of algorithms is based on singular value decomposition (SVD) or eigenvalue decomposition of either data matrices or covariance matrices The basic idea behind the SVD approach is that the singular vectors corresponding to the largest singular values contain speech information, while the remaining singular vectors contain noise information Noise reduction is therefore accomplished by discarding the singular vectors corresponding to the smallest singular values
Subjective evaluations In terms of speech quality, a subset of algorithms improve the overall quality in a few conditions against the unprocessed condition. No algorithm produces improvement in multitalker babble In terms of intelligibility, no algorithm produces significant improvement over unprocessed noisy speech
Interim summary on speech enhancement • Algorithms are derived analytically • Optimization theory • Noise estimation is key • These algorithms are particularly needed for highly non-stationary environments • Speech enhancement algorithms cannot deal with multitalker mixtures • Inability to improve speech intelligibility
Part IV. CASA-based speech segregation Fundamentals of CASA for monaural mixtures CASA for speech segregation Feature-based algorithms Model-based algorithms
Cochleagram: Auditory spectrogram Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleagram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent A waveform signal can be constructed (inverted) from a cochleagram Spectrogram Cochleagram
Neural autocorrelation for pitch perception Licklider (1951)
Correlogram Short-term autocorrelation of the output of each frequency channel of the cochleagram Peaks in summary correlogram indicate pitch periods (F0) A standard model of pitch perception Correlogram & summary correlogram of a vowel with F0 of 100 Hz
Onset and offset detection An onset (offset) corresponds to a sudden intensity increase (decrease), which can be detected by taking the time derivative of the intensity To reduce intensity fluctuations, Gaussian smoothing (low-pass filtering) is typically applied (as in edge detection for image analysis): Note that , where s(t) denotes intensity and
Onset and offset detection (cont.) Hence onset and offset detection is a three-step procedure Convolve the intensity s(t) with G' to obtain O(t) Identify the peaks and the valleys of O(t) Onsets are those peaks above a certain threshold, and offsets are those valleys below a certain threshold Onsets Offsets
Segmentation versus grouping Mirroring Bregman’s two-stage conceptual model, a CASA model generally consists of a segmentation stage and a subsequent grouping stage Segmentation stage decomposes an acoustic scene into a collection of segments, each of which is a contiguous region in the cochleagram with energy primarily from one source Based on cross-channel correlation that encodes correlated responses (temporal fine structure) of adjacent filter channels, and temporal continuity Based on onset and offset analysis Grouping aggregates segments into streams based on various ASA cues
Cross-channel correlation for segmentation Correlogram and cross-channel correlation for a mixture of speech and trill telephone Segments generated based on cross-channel correlation and temporal continuity
Ideal binary mask A main CASA goal is to retain the parts of a mixture where target sound is stronger than the acoustic background (i.e. to mask interference by the target), and discard the other parts (Hu & Wang, 2001; 2004) What a target is depends on intention, attention, etc. In other words, the goal is to identify the ideal binary mask (IBM), which is 1 for a time-frequency (T-F) unit if the SNR within the unit exceeds a threshold, and 0 otherwise It does not actually separate the mixture! More discussion on the IBM in Part V