360 likes | 476 Views
Workshop “Research Domains in Electronics & Telecomm Engineering”, 9th July 2104, Vivekanand Education Society's Institute of Technology, Chembur , Mumbai – 400 074 Single-channel Speech Enhancement for Real-time Applications Prof P C Pandey SPI Lab, EE Dept., IIT Bombay
E N D
Workshop “Research Domains in Electronics & Telecomm Engineering”, 9th July 2104, Vivekanand Education Society's Institute of Technology, Chembur, Mumbai – 400 074 Single-channel Speech Enhancement for Real-time Applications Prof P C Pandey SPI Lab, EE Dept., IIT Bombay http://www.ee.iitb.ac.in/~pcpandey http://www.ee.iitb.ac.in/~spilab IIT Bombay 09/ July/ 2014
Overview Introduction Speech Enhancement Using Spectral Subtraction Investigations Using Offline Processing Implementation for Real-time Processing Summary & Conclusions --------------------------------------------------------------------------------- Main contributors: Santosh K. Waddi, NityaTiwari References S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” in Proc. Nat. Conf. Commun. 2013, New Delhi, India, doi: 10.1109/NCC.2013. 6487989. N. Tiwari, S. K. Waddi, and P. C. Pandey, "Speech enhancement and multi-band frequency compression for suppression of noise and intraspeech spectral masking in hearing aids," in Proc. 10th Annual Conference of the IEEE India Council (IEEE Indicon 2013), Mumbai, December 13-15, 2013, paper no. 524
1. Introduction • Sensorineural hearing loss • Increased hearing thresholds and high frequency loss • Decreased dynamic range & abnormal loudness growth • Reduced speech perception due to increased spectral & temporal masking • → Decreased speech intelligibility in noisy environment • Signal processing in hearing aids • Frequency selective amplification • Automatic volume control • Multichannel dynamic range compression (settable attack time, release time, and compression ratios) • Processing for reducing the effect of increased spectral masking in sensorineural loss • Binaural dichotic presentation (Lunneret al. 1993, Kulkarniet al. 2012) • Spectral contrast enhancement (Yang et al. 2003) • Multiband frequency compression (Arai et al. 2004, Kulkarniet al. 2012)
Techniques for reducing the background noise • Directional microphone • Adaptive filtering (a second microphone needed for noise reference) • Single-channel noise suppression using spectral subtraction • (Boll 1979, Beroutiet al.1979, Martin 1994, Kamath & Loizou 2002, Loizou 2007, Lu & Loizou 2008, Paliwalet al. 2010) • Processing steps • Dynamic estimation of non-stationary noise spectrum • - During non-speech segments using voice activity detection • - Continuously using statistical techniques • Estimation of noise-free speech spectrum • - Spectral noise subtraction • - Multiplication by noise suppression function • Speech resynthesis (using enhanced magnitude and noisy phase)
Objective • Real-time single-input speech enhancement for use in • hearing aids and other sensory aids (cochlear prostheses, etc) • for hearing impaired listeners and in communication devices • Main challenges • Noise estimation without voice activity detection to avoid errors under low-SNR & during long speech segments • Low signal delay(algorithmic + computational) for real-time application • Low computational complexity & memory requirement for implementation on a low-power processor
2. Speech Enhancement Using Spectral Subtraction • Dynamic estimation of non-stationary noise spectrum • Estimation of noise-free speech spectrum • Speech resynthesis
Generalized spectral subtraction (Beroutiet al. 1979) Power subtraction Windowed speech spectrum = Xn(k) Estimated noise mag. spectrum = Dn(k) Estimated speech spectrum Yn(k) = [|Xn(k)|2 – (Dn(k))2 ] 0.5e j<Xn(k) Problems: residual noise due to under-subtraction, distortion in the form of musical noise & clipping due to over-subtraction. |Yn(k)|= [ |Xn(k)|γ – α(Dn(k))γ ]1/γ, if |Xn(k)| > (α+β)1/γDn(k) β1/γDn(k) otherwise γ= exponent factor (2: power subtraction, 1: magnitude subtraction) α= over-subtraction factor (for limiting the effect of short-term variations in noise spectrum) β = floor factor to mask the musical noise due to over-subtraction Re-synthesis with noisy phase without explicit phase calculation Yn(k) = |Yn(k)| Xn(k) / |Xn(k)|
Multi-band spectral subtraction (Kamath & Loizou 2002) • Noise does not effect spectrum uniformly • Speech spectrum divided into B non-overlapping bands, spectral subtraction is performed independently • Test material: 10 sentences from HINT database, noise: speech-shaped noise, Noisy speech: 0 dB and 5 dB SNR • Evaluation:Itakura-Saito (IS) distance method as an objective measure • Improvement over the conventional power spectral subtraction, a very little trace of musical noise • Geometric approach to spectral subtraction (Lu & Loizou 2008) • Without assuming the cross-terms as zero • Test material: NOIZEUS database, noise: babble, street, car, white, Noisy speech: 0 dB, 5 dB and 15 dB SNR • Evaluation: mean square error (MSE), PESQ, log likelihood ratio • Cross terms can be ignored at very low and high SNRs but not near to 0 dB • Proc. output: no audible musical noise , smooth and pleasant residual noise • Performed significantly better than power spectral subtraction in all conditions
Noise estimation • Minimal-tracking algorithms • Minimum statistics (Martin 1994) • Tracks the noise as minima of past frames • Minimum tracking (Doblinger 1995) • Smoothing noisy speech power spectra in each frequency bin using a non-linear smoothing • Time-recursive averaging algorithms • SNR-dependent recursive averaging (Lin 2003) • Noisy speech decomposed into sub-band signals, noisy signal power is smoothened and noise estimated adaptively • Smoothing parameter: function of estimated SNR • Weighted spectral averaging (Hirsch and Ehrlischer 1995) • First order recursive weighted average of past spectral magnitude values over 400 ms which are below an adaptive threshold
Improved minima-controlled recursive averaging (Cohen 2007) • Two iterations of smoothing and tracking • First iteration: rough voice activity detection is provided in each frequency band • Second iteration: smoothing excludes strong speech components, makes the minimum tracking robust during speech activity • Smoothing parameter: frequency-dependent & dynamically adjusted by signal presence probability • Lower estimator error than minimum statistics • Method is combined with log-spectral amplitude estimator • Higher segmental SNR improvement than minimum statistics • Histogram-based technique (Hirsch & Ehrlicher 1995) • Histogram: noisy speech over 400 ms • Noise estimated: maximum of distribution in each sub-band • Avoid spikes: estimated values smoothed along time axis • Objective evaluation: relative error. Low relative error as compared to weighted spectral average method (Hirsch & Ehrlicher 1995)
Quantile-based noise estimation (Stahl 2000) • Speech signal energy: low in most of the frames high in only 10 – 20 % frames • Noise estimation: Selecting certain quantile value from previous frames of noisy speech spectrum • Frequency-dependent and SNR-dependent for quantile selection • Median-based noise estimation works well in a robust manner, but is difficult to use for real-time applications Cascaded-median based estimation (Basha & Pandey 2012) Moving median approximated by p-point q-stage cascaded-median, with a saving in memory & computation for real-time implementation.
MBNE vs CMBNE Comparison Condition for reducing sorting operations low p, p = 3 → code simplification for sorting operations Condition for reducing storage: q ≈ ln(M) • Project objective • Implementation of generalized spectral subtraction along with cascaded-median based noise estimation for real-time processing using a low-power DSP • Selection of optimal set of processing steps and parameters, using offline processing • Implementation on a DSP board with a16-bit fixed-point processor & evaluation
3. Investigations Using Offline Processing • Test material • Speech material1: Recording with three isolated vowels, a Hindi sentence, an English sentence (-/a/-/i/-/u/– “aayiyeaapkaanaamkyaahai?” – “Where were you a year ago?”) from a male speaker. Referred to as "VHSES" • Speech material2: Six sentences from NOIZEUS database of one male speaker • Noise: white, pink, street, babble, car, and train noises. • SNR: ∞, 18, 15,12, 9, 6, 3, 0, -3, -6 dB. • Evaluation methods • Informal listening • Objective evaluation using PESQ measure (0 – 4.5) • Investigations (fs= 10 kHz) • Overlap of 50% & 75% : indistinguishable outputs • γ = 1 (magnitude subtraction) : higher tolerance to variation in α, βvalues
Investigation on noise estimation Scatter plots for magnitude spectra. Speech material: VHSES (a) Clean speech signal (b) White noise (c) Noisy speech: white noise, 3 dB SNR (d) Noisy speech: white noise, 0 dB SNR
Scatter plots for magnitude spectra. Speech material: NOIZEUS (a) Clean speech signal (b) White noise (c) Noisy speech: white noise, 3 dB SNR (d) Noisy speech: white noise, 0 dB SNR
Mean, median and minimum of magnitude spectra of clean speech signal, noise and noisy speech (white, SNR: 0 dB), speech material: VHSES (a) Mean (b) Median • Noisy signal median tracks the noise median & Noisy signal minimum tracks the noise minimum at almost all the frequencies (c) Minimum
Mean, median and minimum of magnitude spectra of clean speech signal, noise and noisy speech (white, SNR: 0 dB), speech material: NOIZEUS (a) Mean (b) Median (c) Minimum
Relative RMS error (dB) • Objective evaluation of the accuracy of noise estimation • Relative RMS error (dB) decreases as SNR decreases (a) Speech material: VHSES (b) Speech material: NOIZEUS
Effect of window length and noise estimation duration • Processing: Magnitude spectral subtraction with median based noise estimation • High PESQ score: Noise estimation across 81 past frames & 20 – 40 ms window length • 30ms window length was chosen (approximately 1.2 s duration) (a) Speech: VHSES, noise: white, SNR: 0dB (b) Speech: NOIZEUS, noise: white, SNR: 0dB
Comparisonof enhanced speech using MBNE and CMBNE • MBNE requires large memory & computation intensive • 3-point 4-stage cascaded-median significantly reduces memory requirement & computations • Reduction in storage requirement per freq. bin: from 162 to 12 samples • Reduction in number of sorting operations per frame per freq. bin: from 40 to 3 • Information listening: Perceptually same • Objective evaluation: Almost same in most cases and maximum difference of 0.06 PESQ score of the enhanced speech. Speech: NOIZEUS, SNR: 0 dB
Effect of spectral subtraction parameters • Processing: Magnitude spectral subtraction using 3-point 4-stage CMBNE • Analysis-synthesis: 30 ms window length & 50% overlap • Spectral floor factor β : 0.01 appropriate for all the cases • Subtraction factor α:in 2 – 2.5 for VHSES and in 1.2 – 1.4 for NOIZEUS speech material • Phase estimation for spectral subtraction • Processing: Magnitude spectral subtraction using 3-point 4-stage CMBNE • Phase: zero, Cepstrum 1978, Quatieri & Oppenheim 1981, Nawabet al. 1983 • Analysis-synthesis: 50% overlap rect. win., 75% overlap rect. win., Griffin-Lim method (Griffin & Lim 1984) • Informal listening: No improvement over by using phase using noisy phase • Objective evaluation: signal estimated using noisy phase has higher PESQ score
Comparison of proposed method with other methods • Proposed method: Magnitude spectral subtraction with cascaded-median based noise estimation. Analysis-synthesis with 30 ms and 50% overlap • Comparison: spectral-subtractive, statistical-model based, and subspace algorithms (implementations available on CD accompanying Loizou 2007: specsub, mband, ga, wiener_iter, wiener_as, wiener_wt, mt_mask, audnoise, mmse, logmmse, logmmse_spu, stsa_weuchild, stsa_wcosh, stsa_mis, kli, pklt) Comparison of PESQ scores for VHSES speech material, 0 dB SNR
Comparison of PESQ scores for NOIZEUS speech material, 0 dB SNR • Observation: Comparable to the best ones
Discussion • FFT length N = 512 & higher: indistinguishable outputs • Processing: Magnitude spectral subtraction with 3-point 4-stage CMBNE, analysis-synthesis with 30ms window length & 50% overlap • Informal listening: Significant enhancement for all noises with different SNR's • Spectral subtraction parameters: β = 0.01 appropriate for all the cases, α in 2 – 2.5 for VHSES and in1.2 – 1.4 for NOIZEUS speech material • SNR advantage: 4 – 13 dB for VHSES & 2 – 7 dB for NOIZEUS speech material
4. Implementationfor Real-time Processing • 16-bit fixed point DSP: TI/TMS320C5515 • 16 MB memory space : 320 KB on-chip RAM with 64 KB dual access RAM, 128 KB on-chip ROM • Three 32-bit programmable timers, 4 DMA controllers each with 4 channels • FFT hardware accelerator (8 to 1024-point FFT) • Max. clock speed: 120 MHz • DSP Board: eZdsp • 4 MB on-board NOR flash for user program • Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization , 8 – 192 kHz sampling • Development environment for C: TI's 'CCStudio, ver. 4.0'
Implementation • One codec channel (ADC and DAC) with 16-bit quantization • Sampling frequency: 10 kHz • Window length of 30 ms (L = 300) with 50% overlap, FFT length N= 512 • Storage of input samples, spectral values, processed samples: 16-bit real & 16-bit imaginary parts
Data transfers and buffering operations (S = L/2) • DMA cyclic buffers • 3 block input buffer • 2 block output buffer • (each with S samples) • Pointers • current input block • just-filled input block • current output block • write-to output block • (incremented cyclically on DMA interrupt) • Signal delay • Algorithmic: • 1 frame (30 ms) • Computational ≤ • frame shift (15 ms)
Results PESQ Score vs SNR for noisy and enhanced speech using offline and real-time processing (a) Speech: VHSES (b) Speech: NOIZEUS • Offline proc. improvement: 0.57 – 0.80 for VHSES & 0.28 – 0.44 for NOIZEUS • Real-time proc. improvement: 0.39 – 0.71 for VHSES & 0.22 – 0.32 for NOIZEUS
Example of Processing : "-/a/-/i/-/u/– "aayiyeaapkaanaamkyaahai?" – "Where were you a year ago?", with white noise at 3 dB SNR (a) Clean speech (b) Noisy speech (c) Offline processed (d) Real-time processed
Comparison of enhanced speech between offline and real-time processed. Speech: VHSES, SNR: 0dB • Real-time processing tested using white, babble, car, pink, train noises: • real-time processed output perceptually similar to the offline processed output • Signal delay = 48 ms • Lowest clock for satisfactory operation = 16.4 MHz→ Processing capacity used ≈ 1/7 of the capacity with highest clock (120 MHz)
5. Summary & Conclusions • Investigation & implementation of spectral subtraction for real-time operation: Magnitude spectrum subtraction and resynthesis using noisy phase, along with cascaded-median based dynamic noise estimation for reducing computation and memory requirement • Enhancement of speech with different types of additive stationary and non-stationary noise:SNR advantage : 4 – 13 dB for VHSES & 2 – 7 dB for NOIZEUS • Implementation for real-time operation using 16-bit fixed-point processor TI/TMS320C5515: Implementation with 10 kHz sampling using 1/7 of processing capacity, signal delay = 48 ms • Further work • Frequency & a posteriori SNR-dependent subtraction & spectral floor factors • Combination of speech enhancement technique with other processing techniques in the sensory aids • Implementation using other processors • Subjective evaluation of intelligibility and quality of enhanced speech
Abstract Sensorineural loss is generally associated with increased spectral masking due to widened auditory filters and the listeners having this kind of hearing impairment often experience great difficulty when the speech is contaminated by noise. This thesis presents investigations for real-time enhancement of noisy speech using spectral subtraction for suppressing the external noise. Investigation using offline processing for enhancing the noisy speech with different types of noise and SNR values is carried out to select the optimal set of steps and parameters for real-time processing. PESQ score is used for objective comparison of quality of the enhanced speech. Results show that median based noise estimation is effective in estimating noise from noisy speech without a voice activity detector, for a wide variety of stationary and non-stationary noises and range of SNR values and that a cascaded-median can be used as an approximation to median for significantly reducing the computation and memory requirement, without adversely affecting the noise estimation. Speech enhancement using magnitude spectrum subtraction with 3-point 4-stage cascaded median for noise estimation and resynthesis using noisy phase resulted in improvements in PESQ scores in the range 0.28 – 0.44 for speech material from NOIZEUS database with added white noise. Resynthesis using phase estimated from the enhanced magnitude spectrum did not result in any further improvement in the scores. The processing technique is implemented and tested for satisfactory operation, with sampling frequency of 10 kHz, 30 ms analysis window with 50% overlap, using a DSP board based on 16-bit fixed-point DSP processor TMS320C5515 with on-chip FFT hardware. The implementation uses data transfer and buffering operations devised for an efficient realization of analysis-synthesis and codec and DMA for acquisition of the input signal and outputting of the processed output signal. The real-time operation is achieved with signal delay of approximately 48 ms and using about one-seventh of the computing capacity of the processor.
Bibliography [1] H. Levitt, J. M. Pickett, and R. A. Houde, Eds., Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980, pp. 3–10. [2] J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. [3] H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. [4] B. C. J. Moore, An Introduction to the Psychology of Hearing, London, UK: Academic, 1997, pp 66–107. [5] T. Lunner, S. Arlinger, and J. Hellgren, “8-channel digital filter bank for hearing aid use: preliminary results in monaural, diotic, and dichotic modes,” Scand. Audiol. Suppl., vol. 38, pp. 75–81, 1993. [6] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss,” Int. J. Audiol., vol. 51, no. 4, pp. 334–344, 2012. [7] J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. [8] T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust. (ICA 2004), Kyoto, Japan, 2004 pp. 1389–1392. [9] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, "Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss," Speech Commun., vol. 54, no. 3, pp. 341–350, 2012. [10] P. C. Loizou, Speech Enhancement: Theory and Practice. New York: CRC, 2007. [11] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113–120, 1979. [12] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP 1979, Washington, DC, pp. 208–211. [13] S. Kamath and P. Loizou, “A multi-band spectral subtraction method for enhancing speech corrupted by colored noise,” in Proc. IEEE ICASSP, 2002, Orlando, Florida, vol. 4, pp. IV–4164. [14] Y. Lu and P. C. Loizou, “ A geometric approach to spectral subtraction,” Speech Commun., vol. 50, no. 6, pp. 453–466, 2008.
[15] K. Paliwal, K. Wojcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech Commun., vol. 52, no. 5, pp. 450–475, 2010. [16] R. Martin, “Spectral subtraction based on minimum statistics,” in Proc. Eur. Signal Process. Conf., 1994, pp. 1182-1185. [17] I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466–475, 2003. [18] H. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP, 1995, Detroit, MI, pp. 153–156. [19] V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP, 2000, Istanbul, Turkey, pp. 1875–1878. [20] G. Doblinger, “Computationally efficient speech enhancement by spectral minima tracking in subbands,” in Proc. 4th Eur. Conf. Speech Commun. and Technology (EUROSPEECH’95), Madrid, Spain, 1995, pp. 1513–1516. [21] L. Lin, W.H. Holmes, and E. Ambikairajah, "Adaptive noise estimation algorithm for speech enhancement," Electronics Letters, vol.39, no. 9, pp.754-755, 2003. [22] C. Ris and S. Dupont, “Assessing local noise level estimation methods: application to noise robust ASR,” Speech Commun., vol. 34, no. 1-2, pp. 141–158, 2001. [23] S. K. Basha and P. C. Pandey, “Real-time enhancement of electrolaryngeal speech by spectral subtraction,” in Proc. Nat. Conf. on Commun. 2012 (NCC 2012), Kharagpur, India, 2012, pp. 516–520. [24] S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” in Proc. Nat. Conf. Commun. (NCC 2013), Delhi, India, 2013, paper no. 1569696063. [25] ITU, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Rec., P.862, 2001. [26] Y. Hu and P. C. Loizou, “Subjective evaluation and comparison of speech enhancement algorithms,” Speech Communication, vol. 49, pp. 588–601, 2007. [27] T. F. Quatieri, and A. V. Oppenheim, “Iterative techniques for minimum phase signal reconstruction from phase or magnitude,” IEEE Trans. Acoust., Speech, Signal Process., vol. 29, no. 6, pp. 1187–1193, 1981. [28] S. H. Nawab, T. F. Quatieri, and J. S. Lim, “Signal-reconstruction from short time Fourier transform magnitude,” IEEE Trans. Acoust., Speech Signal Process., vol. 31, no. 4, pp. 986–998, 1983.
[29] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, New Jersey: Prentice Hall, 1978, pp. 356–362. [30] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoust., Speech, and Signal Process., vol. 32, no. 2, pp. 236–243, 1984. [31] Spectrum Digital, Inc. (2010) TMS320C5515 eZdsp USB Stick Technical Reference. [online]. Available: support.spectrumdigital.com/boards/usbstk5515/reva/files/usbstk 5515_TechRef_RevA.pdf [32] Texas Instruments, Inc. (2011) TMS320C5515 Fixed-Point Digital Signal Processor. [online]. Available: focus.ti.com/lit/ds/symlink/tms320c5515.pdf. [33] Texas Instruments, Inc. (2008) TLV320AIC3204 Ultra Low Power Stereo Audio Codec. [online]. Available: focus.ti.com/lit/ds/symlink/tlv320aic3204.pdf. [34] S. K. Waddi, "Real-time enhancement of noisy speech using spectral subtraction," M.Tech thesis, Electrical Engineering, Indian Institute of Technology Bombay, 2013. • [35] S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” in Proc. Nat. Conf. Commun. 2013, New Delhi, India, doi: 10.1109/NCC.2013. 6487989. • [36] N. Tiwari, S. K. Waddi, and P. C. Pandey, "Speech enhancement and multi-band frequency compression for suppression of noise and intraspeech spectral masking in hearing aids," in Proc. 10th Annual Conference of the IEEE India Council (IEEE Indicon 2013), Mumbai, December 13-15, 2013, paper no. 524