500 likes | 685 Views
Speech Enhancement EE 516 Spring 2009. Alex Acero. Outline. A model of the acoustical environment Simple things first! Microphones Echo cancellation Microphone arrays Single channel noise suppression. Additive noise. Stationary noise: properties don’t change over time:
E N D
Speech EnhancementEE 516 Spring 2009 Alex Acero
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
Additive noise • Stationary noise: properties don’t change over time: • White noise x[n] • flat power spectrum • Samples are uncorrelated • White Gaussian Noise • Pdf is Gaussian (see chapter 10) • Typical noise is colored • Pink noise: low-pass in nature • Non-stationary: properties changes over time • Babble noise • Cocktail party effect
Reverberation • Impulse response of an average office
Model of the Environment h[m] x[m] + y[m] n[m]
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
Cepstral Mean Normalization Compute mean of cepstrum And subtract it from input CMN robust to channel distortion Normalizes average vocal tract or short filters Average must include > 2 sec of speech
RASTA • CMN is a low-pass filter with rectangular window • Can use other low-pass filters too • RASTA filter is band-pass
Retrain with noisy data • Mismatches between training and testing are bad for pattern recognition systems • Retrain with noisy data • Approximation: add noise to clean data and retrain
Multi-condition training • Very hard to predict exactly the type of noise we’ll encounter at test time • Too expensive to retrain the system for each noise condition • Train system offline with several noise types and levels
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
b Microphone Preamplifier b ZM RL v(t) h + G - ~ Condenser Microphone
Mic opening Diaphragm Ommidirectional microphones • Polar response
source r1 r2 r (–d, 0) (d, 0) Bidirectional microphones Speech sound wave from the front Noise sound wave from the side
Bidirectional microphones • bidirectional microphone with d=1 cm at 0 • Solid line corresponds to far field conditions ( ) and the dotted line to near field conditions ( )
Unidirectional microphones Speech sound wave from the front Noise sound wave from the side
Output voltage Diaphragm Magnet Coil Dynamic microphones
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
Acoustic Echo cancellation Loudspeaker x[n] Acoustic path H Adaptive filter Speech signal s[n] d[n] + + e[n] Local noise r[n] v[n] - Microphone
+ + Line echo cancellation x[n] Hybrid circuit H Adaptive filter Speaker B Speaker A d[n] s[n] e[n] r[n] v[n] - Noise
Least Mean Squares (LMS) • Given input • Estimate output • Compute error • Update filter • Need to tune step size
Normalized LMS • Make step size adaptive to ensure convergence • Where we track the input energy
f(x) x1 x0 Recursive Least Squares (RLS) • Newton Raphson • New weights • Faster convergence, but more CPU intensive
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m Angle 0 400Hz 880Hz 4400Hz 8000 Hz M2 S M1 M0 M-1 M-2
Microphone arrays: delay & sum 5 microphones spaced 5 cm apart. Source located at 5 m. Angle 30 400Hz 880Hz 4400Hz 8000 Hz M2 S M1 M0 M-1 M-2
Bone microphone for noise robust ASR • Conventional microphones are sensitive to noise • Bone microphones are more noise resistant, but distort the signal • Not enough data to retrain recognizer with bone microphone • Fusion between acoustic microphone and bone microphone
Relationship between acoustic mic and bone mic Acoustic Contact
Blind source separation • Linear mixing • Estimate filter • Separate signals • Using assumption signals are independent • Do gradient descent:
y1[n] z1[n] z1[n] y1[n] h11[n] h11[n] h12[n] h12[n] h21[n] h21[n] y2[n] z2[n] y2[n] z2[n] h22[n] h22[n] + + + + Blind source separation Idea: Estimate filters h11[n] and h12[n] that maximize p(z1[n]|) where is a HMM. Approximate HMM by a Gaussian Mixture Model with LPC parameters => EM algorithm with a linear set of equations
Outline • A model of the acoustical environment • Simple things first! • Microphones • Echo cancellation • Microphone arrays • Single channel noise suppression
Spectral subtraction Corrupted signal Power spectrum but So Estimate noise power spectrum from noisy frames Estimate clean power spectrum as
Spectral subtraction Keep original phase Ensure it’s positive
Aurora2 • ETSI STQ group • TIDigits • Added noise at SNRs: -5dB, 0dB, 5dB, 10dB, 15dB, 20dB • Set A: subway, babble, car, exhibition • Set B: restaurant, airport, street, station • Set C: one noise from set A and one noise from set C • Aurora 3 recorded in car (no digital mixing!) • Aurora4 for large vocabulary • Advanced Front-End (AFE) standard (2001) uses a variant of spectral subtraction
Aurora 2 (Clean training) Using SPLICE algorithm
Aurora 2 (multi-condition training) Using SPLICE algorithm
Wiener Filtering • Find linear estimate of clean signal • MMSE (Minimum Mean Squared Error) • Wiener-Hopf equation • In Freq domain • If noise and signal are uncorrelated
Wiener Filtering • Find linear estimate of clean signal • If noise and signal are uncorrelated • With • Compare with Spectral Subtraction
Vector Taylor Series (VTS) • Acero, Moreno • The power spectrum, on the average • Taking logs • Cepstrum is DCT (matrix C) of log power spectrum
Vector Taylor Series (VTS) • x, h, and n are Gaussian random vectors with means , , and and covariance matrices , , and • Expand y in first-order Taylor series
Vector Taylor Series • Distribution of corrupted log-spectra • Noise with mean of 0dB and std dev of 2dB • Speech with mean of 25dB • Montecarlo simulation • Std dev: 25dB 10dB 5dB
Phase matters Corrupted signal Spectrum But is only an approximation
Noise HMM Speech HMM Observations Non-stationary noise • Speech/Noise decomposition (Varga et al.)