Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation

Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners [Ref: S. K. Waddi, P. C. Pandey, N.Tiwari, Proc. NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063]

Overview Introduction Signal Processing for Spectral Subtraction Implementation for Real-time Processing Test Results Summary & Conclusion

1. Introduction • Sensorineural hearing loss • Increased hearing thresholds and high frequency loss • Decreased dynamic range & abnormal loudness growth • Reduced speech perception due to increased spectral & temporal masking • → Decreased speech intelligibility in noisy environment • Signal processing in hearing aids • Frequency selective amplification • Automatic volume control • Multichannel dynamic range compression (settable attack time, release time, and compression ratios) • Processing for reducing the effect of increased spectral masking in sensorineural loss • Binaural dichotic presentation (Lunneret al. 1993, Kulkarniet al. 2012) • Spectral contrast enhancement (Yang et al. 2003) • Multiband frequency compression (Arai et al. 2004, Kulkarniet al. 2012)

Techniques for reducing the background noise • Directional microphone • Adaptive filtering (a second microphone needed for noise reference) • Single-channel noise suppression using spectral subtraction • (Boll 1979, Beroutiet al.1979, Martin 1994, Loizou 2007, Lu & Loizou 2008, Paliwal et al. 2010) • Processing steps • Dynamic estimation of non-stationary noise spectrum • - During non-speech segments using voice activity detection • - Continuously using statistical techniques • Estimation of noise-free speech spectrum • - Spectral noise subtraction • - Multiplication by noise suppression function • Speech resynthesis (using enhanced magnitude and noisy phase)

Research objective • Real-time single-input speech enhancement for use in • hearing aids and other sensory aids (cochlear prostheses, etc) • for hearing impaired listeners • Main challenges • Noise estimation without voice activity detection to avoid errors • under low-SNR & during long speech segments • Low signal delay(algorithmic + computational) for real-time application • Low computational complexity & memory requirement for implementation on a low-power processor • Proposed technique: Spectral subtraction using cascaded-median based continuous updating of the noise spectrum (without using voice activity detection) • Real-time implementation: 16-bit fixed-point DSP with on-chip FFT hardware • Evaluation: Informal listening, PESQ-MOS

2. Signal Processing for Spectral Subtraction • Dynamic estimation of non-stationary noise spectrum • Estimation of noise-free speech spectrum • Speech resynthesis

Power subtraction Windowed speech spectrum = Xn(k) Estimated noise mag. spectrum = Dn(k) Estimated speech spectrum Yn(k) = [|Xn(k)|2 – (Dn(k))2 ] 0.5e j<Xn(k) Problems: residual noise due to under-subtraction, distortion in the form of musical noise & clipping due to over-subtraction. Generalized spectral subtraction (Beroutiet al. 1979) |Yn(k)|= β1/γDn(k), if |Xn(k)| < (α+β)1/γDn(k) [ |Xn(k)|γ – α(Dn(k))γ]1/γotherwise γ= exponent factor (2: power subtraction, 1: magnitude subtraction) α= over-subtraction factor (for limiting the effect of short-term variations in noise spectrum) β = floor factor to mask the musical noise due to over-subtraction Re-synthesis with noisy phase without explicit phase calculation Yn(k) = |Yn(k)| Xn(k) / |Xn(k)|

Dynamic estimation of noise magnitude spectrum • Min. statistics based est. (Martin 1994): SNR-dependent over-subtraction factor. • Median based est. (Stahl et al. 2000): Large computation & memory. • Pseudo-median based est. (Basha & Pandey 2012) : moving median approximated by p-point q-stage cascaded-median, with a saving in memory & computation for real-time implementation. Median Storage per freq. bin Sortings per frame per freq. bin M-point 2M (M–1)/2 p-pontq-stage (M =pq) pqp(p–1)/2 Condition for reducing sorting operations and storage: low p, q ≈ ln(M)

Re-synthesis of enhanced signal • Spectral subtraction → Enhanced magnitude spectrum • Enh. magnitude spectrum & original phase spectrum → Complex spectrum • Resynthesis using IFFT and overlap-add • Investigations using offline implementation (fs= 12 kHz, frame = 30 ms) • Overlap of 50% & 75% : indistinguishable outputs • FFT length N = 512 & higher : indistinguishable outputs • γ = 1 (magnitude subtraction) : higher tolerance to variation in α, βvalues • Duration needed for dynamic noise estimation ≈ 1 s • p = 3 for simplifying programming and reducing the sorting operations • 3-frame 4-stage cascaded-median (M=81, p=3, q=4) , 50% overlap: moving median over 1.215 s • Outputs using true-median & cascaded-median: indistinguishable • Reduction in storage requirement per freq. bin: from 162 to 12 samples • Reduction in number of sorting operations per frame per freq. bin: from 40 to 3

3. Implementationfor Real-time Processing • 16-bit fixed point DSP: TI/TMS320C5515 • 16 MB memory space : 320 KB on-chip RAM with 64 KB dual access RAM, 128 KB on-chip ROM • Three 32-bit programmable timers, 4 DMA controllers each with 4 channels • FFT hardware accelerator (8 to 1024-point FFT) • Max. clock speed: 120 MHz • DSP Board: eZdsp • 4 MB on-board NOR flash for user program • Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization , 8 – 192 kHz sampling • Development environment for C: TI's 'CCStudio, ver. 4.0'

Implementation • One codec channel (ADC and DAC) with 16-bit quantization • Sampling frequency: 12 kHz • Window length of 30 ms (L = 360) with 50% overlap, FFT length N= 512 • Storage of input samples, spectral values, processed samples: 16-bit real & 16-bit imaginary parts

Data transfers and buffering operations (S = L/2) • DMA cyclic buffers • 3 block input buffer • 2 block output buffer • (each with S samples) • Pointers • current input block • just-filled input block • current output block • write-to output block • (incremented cyclically on DMA interrupt) • Signal delay • Algorithmic: • 1 frame (30 ms) • Computational ≤ • frame shift (15 ms)

4. Test Results • Test material • Speech: Recording with three isolated vowels, a Hindi sentence, an English sentence (-/a/-/i/-/u/– “aayiyeaapkaanaamkyaahai?” – “Where were you a year ago?”) from a male speaker. • Noise: white, pink, babble, car, and train noises (AURORA ). SNR: ∞, 15, 12, 9, 6, 3, 0, -3, -6 dB. • Evaluation methods • Informal listening • Objective evaluation using PESQ measure (0 – 4.5) • Results: Offline processing • Processing parameters: β= 0.001, α: 1.5 – 2.5. • Informal listening: no audible roughness in the enhanced speech, speech clipping at larger α.

PESQ score vs SNR: noisy & enhanced speech SNR advantage : 13 dB for white noise, 4 dB for babble

Processing examples & PESQ scores • Speech material: speech-speech-silence-speech, speech: /-a-i-u-/––"ayiye"–– "aapkaanaamkyaahai?"––" where were you a year ago?” • Processing parameters: Frame length = 30 ms, Overlap = 50%, β = 0.001, noise estimation by 3-point 4-stage casc. median (estimation duration = 1.215 s)

Example: “Where were you a year ago”, white noise, input SNR = 3 dB. (a) Clean speech (c) Offline processed (b) Noisy speech (d) Real-time processed More examples: http://www.ee.iitb.ac.in/~spilab/material/santosh/ncc2013

Results: Real-time processing • Real-time processing tested using white, babble, car, pink, train noises: • real-time processed output perceptually similar to the offline processed output • Signal delay = 48 ms • Lowest clock for satisfactory operation = 16.4 MHz → Processing capacity used ≈ 1/7 of the capacity with highest clock (120 MHz)

5. Summary & Conclusions • Proposed technique for suppression of additive noise: cascaded-median based dynamic noise estimation for reducing computation and memory requirement for real-time operation. • Enhancement of speech with different types of additive stationary and non-stationary noise: SNR advantage (at PESQ score = 2.5): 4 – 13 dB, Increase in PESQ score (at SNR = 0dB): 0.37 – 0.86. • Implementation for real-time operation using 16-bit fixed-point processor TI/TMS320C5515: used-up processing capacity ≈ 1/7, delay = 48 ms. • Further work • Frequency & a posteriori SNR-dependent subtraction & spectral floor factors • Combination of speech enhancement technique with other processing techniques in the sensory aids • Implementation using other processors

Thank You

Abstract A spectral subtraction technique is presented for real-time speech enhancement in the aids used by hearing impaired listeners. For reducing computational complexity and memory requirement, it uses a cascaded-median based estimation of the noise spectrum without voice activity detection. The technique is implemented and tested for satisfactory real-time operation, with sampling frequency of 12 kHz, processing using window length of 30 ms with 50% overlap, and noise estimation by 3-frame 4-stage cascaded-median, on a 16-bit fixed-point DSP processor with on-chip FFT hardware. Enhancement of speech with different types of additive stationary and non-stationary noise resulted in SNR advantage of 4 – 13 dB.

References [1]H. Levitt, J. M. Pickett, and R. A. Houde (eds.), Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980. [2] J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. [3] H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. [4] T. Lunner, S. Arlinger, and J. Hellgren, “8-channel digital filter bank for hearing aid use: preliminary results in monaural, diotic, and dichotic modes,” Scand. Audiol. Suppl., vol. 38, pp. 75–81, 1993. [5] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss,” Int. J. Audiol., vol. 51, no. 4, pp. 334–344, 2012. • [6]J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. [7] T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust., 2004, Kyoto, Japan, pp. 1389–1392. [8] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss,” Speech Commun., vol. 54, no. 3 pp. 341–350, 2012. • [9] P. C. Loizou, "Speech processing in vocoder-centric cochlear implants," in A. R. Moller (ed.), Cochlear and Brainstem Implants, Adv. Otorhinolaryngol. vol. 64, Basel: Karger, 2006, pp. 109–143. • [10] P. C. Loizou, Speech Enhancement: Theory and Practice. New York: CRC, 2007. • [11] R. Martin, “Spectral subtraction based on minimum statistics,” in Proc. Eur. Signal Process. Conf., 1994, pp. 1182-1185. • [12] I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003. • [13] H. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 153-156.

[14] V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP, 2000, pp. 1875-1878. [15] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP, 1979, pp. 208-211. [16] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979. [17] Y. Lu and P. C. Loizou, “A geometric approach to spectral subtraction,” Speech Commun., vol. 50, no. 6, pp. 453-466, 2008. • [18] S. K. Basha and P. C. Pandey, “Real-time enhancement of electrolaryngeal speech by spectral subtraction,” in Proc. Nat. Conf. Commun. 2012, Kharagpur, India, pp. 516-520. • [19] K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech Commun., vol. 52, no. 5, pp. 450–475, 2010. • [20] Texas Instruments, Inc., “TMS320C5515 Fixed-Point Digital Signal Processor,” 2011, [online] Available: focus.ti.com/lit/ds/symlink/ tms320c5515.pdf. • [21] Spectrum Digital, Inc., “TMS320C5515eZdsp USB Stick Technical Reference,” 2010, [online] Available: support.spectrumdigital.com/boards/usbstk5515/reva/files/ usbstk5515_TechRef_RevA.pdf • [22] Texas Instruments, Inc., “TLV320AIC3204 Ultra Low Power Stereo Audio Codec,” 2008, [online] Available: focus.ti.com/lit/ds/ symlink/tlv320aic3204.pdf. • [23] ITU, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Rec., P.862, 2001. • [24] S. K. Waddi, “Speech enhancement results”, 2013, [online] Available: www.ee.iitb.ac.in/ ~spilab/material/santosh/ncc2013.

Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation

Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation

Presentation Transcript

Wavelet-Based Speech Enhancement

Dealing with Acoustic Noise Part 1: Spectral Estimation

Introduction to Spectral Estimation

Noise Supression Techniques for Speech Enhancement Using Adaptive Filtering

Speech Enhancement

Detecting Digital Forgeries using Blind Noise Estimation

Spectral contrast enhancement

Estimation of Voicing-Character of Speech Spectra Based on Spectral Shap

Spectral subtraction algorithm and optimize

Speech Enhancement Using Spectral Subtraction

Wavelet-Based Speech Enhancement

Speech Enhancement

Dealing with Acoustic Noise Part 1: Spectral Estimation

EEG Classification Using Maximum Noise Fractions and spectral classification

Speech Enhancement using Excitation Source Information

Enhancement of Electrolaryngeal Speech by Reducing Leakage Noise Using Spectral Subtraction by

Speech enhancement in nonstationary noise environments using noise properties

Enhancement of Electrolaryngeal Speech by Reducing Leakage Noise Using Spectral Subtraction by

Wearable Speech Enhancement

Implementation of a noise subtraction algorithm using Verilog HDL

Speech Enhancement through Noise Reduction

Speech Enhancement Based on Nonparametric Factor Analysis