440 likes | 456 Views
Deep Neural Networks for Supervised Speech Separation. DeLiang Wang Perception & Neurodynamics Lab Ohio State University. Outline of presentation. Separation as DNN based classification Ideal binary mask and speech intelligibility Extensive training for better generalization
E N D
Deep Neural Networks for Supervised Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University
Outline of presentation • Separation as DNN based classification • Ideal binary mask and speech intelligibility • Extensive training for better generalization • Speech intelligibility tests on hearing impaired listeners • On training targets • Masking vs. mapping, and binary vs. ratio masking • Speech intelligibility tests on new noise segments • Time-domain signal reconstruction • Related efforts
Auditory masking phenomenon Definition: “The process by which the threshold of audibility for one sound is raised by the presence of another (masking) sound” (American Standards Association, 1960) A basic phenomenon in auditory perception Roughly speaking, a strong sound masks a weaker one within a critical band
Ideal binary mask as a separation goal Motivated by the auditory masking phenomenon and auditory scene analysis, we suggested the ideal binary mask as a main goal of CASA (Hu & Wang, 2001; 2004) The idea is to retain parts of a mixture where the target sound is stronger than the acoustic background, and discard the rest The definition of the ideal binary mask (IBM) θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB Optimal SNR: Under certain conditions the IBM with θ = 0 dB is the optimal binary mask in terms of SNR gain (Li & Wang, 2009) Maximal articulation index (AI) in a simplified version (Loizou & Kim, 2011) It does not actually separate the mixture! 4
Subject tests of ideal binary masking • IBM separation leads to large speech intelligibility improvements • Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners (Brungart et al.’06; Li & Loizou’08; Cao et al.’11; Ahmadi et al.’13), and above 9 dB for hearing-impaired (HI) listeners (Anzalone et al.’06; Wang et al.’09) • Improvement for modulated noise is significantly larger than for stationary noise • With the IBM as the goal, the speech separation problem becomes a binary classification problem • This new formulation opens the problem to a variety of pattern classification methods
Speech perception of noise with binary gains • Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) • IBM modulated noise for ??? Speech shaped noise
DNN as subband classifier (Wang & Wang’13) • Why deep neural network (DNN)? • Automatically learn more abstract features as the number of layers increases • More abstract features tend to be more invariant to superficial variations • Y. Wang and Wang (2013) first introduced DNN to address the speech separation problem • DNN is used for as a subband classifier, performing feature learning from raw acoustic features
Extensive training with DNN • Training on 200 randomly chosen utterances from both male and female IEEE speakers, mixed with 100 environmental noises (Hu’04) at 0 dB (~17 hours long) • Six million fully dense training samples in each channel, with 64 channels in total • Evaluated on 20 unseen speakers mixed with 20 unseen noises at 0 dB
DNN-based separation results • Comparisons with a representative speech enhancement algorithm (Hendriks et al. 2010) • Using clean speech as ground truth, on average about 3 dB SNR improvements • Using IBM separated speech as ground truth, on average about 5 dB SNR improvements
Speech intelligibility evaluation • We subsequently tested speech intelligibility of hearing-impaired listeners (Healy et al.’13) • A very challenging problem: “The interfering effect of background noise is the single greatest problem reported by hearing aid wearers” (Dillon’12) • Two stage DNN training to incorporate T-F context in classification
An illustration A HINT sentence mixed with speech-shaped noise at -5 dB SNR
Results and sound demos • Both HI and NH listeners showed intelligibility improvements • HI subjects with separation outperformed NH subjects without separation
Outline of presentation • Separation as DNN based classification • Ideal binary mask and speech intelligibility • Extensive training for better generalization • Speech intelligibility tests on hearing impaired listeners • On training targets • Masking vs. mapping, and binary vs. ratio masking • Speech intelligibility tests on new noise segments • Time-domain signal reconstruction • Related efforts
On training targets (Y. Wang et al.’14) • What supervised training aims to learn is important for speech separation or enhancement • Different training targets lead to different mapping functions from noisy features to separated speech • Different targets may have different levels of generalization • While the IBM is first used in supervised separation, the quality of separated speech is a persistent issue • What are reasonable alternative targets? • Target binary mask (TBM) (Kjems et al.’09; Gonzalez & Brookes’14) • Ideal ratio mask (IRM) (Srinivasan et al.’06; Narayanan & Wang’13; Hummersone et al.’14) • STFT spectral magnitude (Xu et al.’14; Han et al.’14)
Different training targets • TBM: similar to the IBM except that interference is fixed to SSN • IRM • β is a tunable parameter, and a good choice is 0.5 • With β = 0.5, the IRM becomes a square root Wiener filter, which is the optimal estimator of the power spectrum
Different training targets (cont.) • Gammatone frequency power spectrum (GF-POW) • FFT-MAG of clean speech • FFT-MASK
Illustration of various training targets Factory noise at -5 dB
Evaluation methodology • Learning machine: DNN with 3 hidden layers, each with 1024 units • Target speech: TIMIT • Noises: SSN + four nonstationary noises from NOISEX • Training and testing on different segments of each noise • Trained at -5 and 0 dB, and tested at -5, 0, and 5 dB • Evaluation metrics • STOI: standard metric for predicted speech intelligibility • PESQ: standard metric for perceptual speech quality • SNR
Comparisons • Comparisons among different training targets • Trained and tested on each noise, but different segments • An additional comparison for the IRM target with multi-condition (MC) training of all noises (MC-IRM) • Comparisons with different approaches • Speech enhancement (Hendriks et al.’10) • Supervised NMF: ASNA-NMF (Virtanen et al.’13), trained and tested in the same way as supervised separation
STOI comparison for factory noise Spectral mapping NMF & SPEH Masking
PESQ comparison for faculty noise Spectral mapping NMF & SPEH Masking
Summary among different targets • Supervised separation yields substantial improvements over unprocessed mixtures • Among the two binary masks, IBM estimation performs better in PESQ than TBM estimation • Ratio masking tends to perform better than binary masking, mainly for speech quality • IRM, FFT-MASK, and GF-POW produce comparable PESQ results • FFT-MASK is better than FFT-MAG for estimation • Many-to-one mapping in FFT-MAG vs. one-to-one mapping in FFT-MASK, and the latter should be easier to learn • Estimation of spectral magnitudes or their compressed version tends to magnify estimation errors
Generalization to new noise segments • While previous speech intelligibility results are impressive, a major limitation is that training and test noise samples were drawn from the same noise segments • Speech utterances were different • Noise samples were randomized • We have recently addressed this limitation through extensive training (Healy et al.’15) • IRM estimation rather than IBM estimation • Frame-level estimation rather than subband classification • Training on the first 8 minutes of two nonstationary noises (20-talker babble and cafeteria noise) and test on the last 2 minutes of the noises • Noise perturbation is used to enrich noise samples for training • See Chen et al.’s poster this afternoon
Results and demos • HI listeners showed intelligibility improvements with both noises at both SNRs • NH listeners showed intelligibility improvements for babble noise, but not for cafeteria noise
Outline of presentation • Separation as DNN based classification • Ideal binary mask and speech intelligibility • Extensive training for better generalization • Speech intelligibility tests on hearing impaired listeners • On training targets • Masking vs. mapping, and binary vs. ratio masking • Speech intelligibility tests on new noise segments • Time-domain signal reconstruction • Related efforts
Time-domain signal reconstruction • Current supervised separation learns to estimate either an ideal mask or spectral magnitudes • An intermediate stage: after estimation, speech is resynthesized in a separate step along with noisy phase • Can we use DNN to directly reconstruct the clean time-domain speech signal? • i.e., using time-domain clean speech as the training target? • Direct reconstruction of the time-domain signal can potentially lead to improved perceptual quality • Intelligibility is largely coded by spectral envelopes (Plomp’02) • It is assumed that quality is coded in temporal fine structure
Background analysis • Touched upon before by Tamura and Waibel (1988), who used an (tiny) MLP to predict clean waveform from noisy waveform • We will show that this predicting clean waveform does not work well • Incorporate domain knowledge into network training. We know that: • T-F masking is effective for suppressing background noise • Frequency domain to time domain conversion can be done by inverse transforms • We propose a new DNN architecture that unifies T-F masking and speech resynthesis (Y. Wang & Wang’15)
DNN architecture Speech resynthesis T-F masking
DNN architecture (cont.) • Introducing a masking layer and an inverse transform layer • We perform masking in the DFT domain, so the resynthesis is conveniently implemented as an inverse DFT • Both forward and backward pass can be written as matrix operations, enabling efficient backprop training • Key differences from standard mask estimators • Speech resynthesis and masking are jointly trained • T-F mask is not predefined (e.g. IRM), but rather automatically learned in light of the loss function of interest, i.e. task-dependent masking • Phase information is explicitly used in learning, and one could use enhanced phase instead of noisy phase
DNN training • Input feature: a complementary feature set (Wang et al.’13) plus gammatone filterbank energies • Training target: windowed clean waveform in each frame • The standard mean squared error between the estimated and clean signal is used as the loss function • In testing, we use the trained network to directly predict the windowed clean waveform snippets in each frame, which are overlap-added to produce the final time-domain reconstruction of the entire utterance
Experimental setting • Use the TIMIT corpus • Randomly choose 2000 utterances for training • Test on the TIMIT core test set (192 utterances) • Speaker-independent setting • Four non-stationary noises • Factory, 100-talker babble, engine, and operation room (called “oproom”) noises from NOISEX-92 • Train on first half of the noise, and test on the second half • Each noise is 4 minutes long • Training on -5 and 0 dB, and test on -5, 0 and 5 dB
Experimental setting • Comparisons among: • IBM-DNN: DNN based IBM estimator • IRM-DNN: DNN based IRM estimator • IFFT-DNN: the proposed system • DT-DNN: use a standard DNN to predict clean waveform from noisy waveform • ASNA-NMF • DT-DNN uses noisy waveform as input feature, while the other DNN systems use the same combined feature set mentioned before • All DNNs use 3 hidden layers, each having 1024 rectified linear units • We use a composite measure OVRL (Hu & Loizou’08), ranging from 1 to 5, to measure objective quality, and STOI
Visualization • Masking layer activations obtained on a TIMIT utterance mixed with the factory noise at 5 dB • The learned activations look like a ratio mask
Results -5 dB test set • IFFT-DNN significantly outperforms IRM-DNN in OVRL (e.g. Engine), and produces slightly worse or comparable STOI • DT-DNN does not work well; the obtained STOIs are even worse than mixtures • ASNA-NMF works reasonably well, but is significantly worse than the proposed system
Results on 5 dB test set • Intelligibility is not a concern in relatively high SNR conditions • IFFT-DNN produces similar STOI results to IRM-DNN, while still achieving better OVRL results
Demos With 5 dB cafeteria noise Mixture IRM-DNN IFFT-DNN Clean Mixture IRM-DNN IFFT-DNN Clean
Outline of presentation • Separation as DNN based classification • Ideal binary mask and speech intelligibility • Extensive training for better generalization • Speech intelligibility tests on hearing impaired listeners • On training targets • Masking vs. mapping, and binary vs. ratio masking • Speech intelligibility tests on new noise segments • Time-domain signal reconstruction • Related efforts
Joint binaural and monaural classification for reverberant speech separation • Speech separation has traditionally used binaural cues • CASA, ICA, beamforming • We recently use binaural (ITD and ILD) and monaural (GFCC) features to train a DNN classifier to estimate the IBM (Jiang et al.’14) • DNN-based classification produces excellent results in low SNR, reverberation, and different spatial configurations • With reasonable coverage of training configurations, the approach generalizes well to untrained configurations • The inclusion of monaural features improves separation performance when target and interference are close, potentially overcoming a major hurdle in beamforming and other spatial filtering methods
Special session this afternoon: DNN for supervised speech separation/enhancement • Gao et al. • Incorporating a DNN-based voice activity detector to provide frame-level speech presence probabilities, used to weight speech estimates • Substantial separation performance improvements at low SNRs • Weninger et al. • Recurrent neural networks (long short-term memory) for speech enhancement • Integration with an ASR backend leads to the state-of-the-art robust ASR results on the CHiME-2 corpus • Kim & Smaragdis • Improving already trained DNNs in unknown environments • Stacking an AutoEncoder on top of a DenoisingAutoEncoder to fine-tune the latter for adaptive separation • Chen et al. • Noise perturbation techniques to enlarge noise samples for DNN training, which leads to better separation performance
DNN or supervised learning for related tasks • Dereverberation, and joint dereverberation and denoising (Han et al.’14; ‘15) • Multipitch tracking (Lee and Ellis’12, Huang & Lee’13, Han & Wang’14) • Voice activity detection (e.g. Zhang et al.’13) • SNR estimation (Papadopoulos et al.’14)
Conclusion • Formulation of separation as classification or mask estimation enables the use of supervised learning • Extensive training with DNN is a promising direction • This approach has yielded the first demonstrations of speech intelligibility improvement in noise • Ratio masking should be preferred over binary masking for better speech quality • Mask estimation is preferred over spectral mapping • New DNN architecture to directly reconstruct time-domain clean signal • Joint training of a masking layer and an inverse DFT layer leads to better speech quality