260 likes | 510 Views
Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners [Ref: S. K. Waddi , P. C. Pandey, N.Tiwari , Proc. NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063]. Overview Introduction
E N D
Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners [Ref: S. K. Waddi, P. C. Pandey, N.Tiwari, Proc. NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063]
Overview Introduction Signal Processing for Spectral Subtraction Implementation for Real-time Processing Test Results Summary & Conclusion
1. Introduction • Sensorineural hearing loss • Increased hearing thresholds and high frequency loss • Decreased dynamic range & abnormal loudness growth • Reduced speech perception due to increased spectral & temporal masking • → Decreased speech intelligibility in noisy environment • Signal processing in hearing aids • Frequency selective amplification • Automatic volume control • Multichannel dynamic range compression (settable attack time, release time, and compression ratios) • Processing for reducing the effect of increased spectral masking in sensorineural loss • Binaural dichotic presentation (Lunneret al. 1993, Kulkarniet al. 2012) • Spectral contrast enhancement (Yang et al. 2003) • Multiband frequency compression (Arai et al. 2004, Kulkarniet al. 2012)
Techniques for reducing the background noise • Directional microphone • Adaptive filtering (a second microphone needed for noise reference) • Single-channel noise suppression using spectral subtraction • (Boll 1979, Beroutiet al.1979, Martin 1994, Loizou 2007, Lu & Loizou 2008, Paliwal et al. 2010) • Processing steps • Dynamic estimation of non-stationary noise spectrum • - During non-speech segments using voice activity detection • - Continuously using statistical techniques • Estimation of noise-free speech spectrum • - Spectral noise subtraction • - Multiplication by noise suppression function • Speech resynthesis (using enhanced magnitude and noisy phase)
Research objective • Real-time single-input speech enhancement for use in • hearing aids and other sensory aids (cochlear prostheses, etc) • for hearing impaired listeners • Main challenges • Noise estimation without voice activity detection to avoid errors • under low-SNR & during long speech segments • Low signal delay(algorithmic + computational) for real-time application • Low computational complexity & memory requirement for implementation on a low-power processor • Proposed technique: Spectral subtraction using cascaded-median based continuous updating of the noise spectrum (without using voice activity detection) • Real-time implementation: 16-bit fixed-point DSP with on-chip FFT hardware • Evaluation: Informal listening, PESQ-MOS
2. Signal Processing for Spectral Subtraction • Dynamic estimation of non-stationary noise spectrum • Estimation of noise-free speech spectrum • Speech resynthesis
Power subtraction Windowed speech spectrum = Xn(k) Estimated noise mag. spectrum = Dn(k) Estimated speech spectrum Yn(k) = [|Xn(k)|2 – (Dn(k))2 ] 0.5e j<Xn(k) Problems: residual noise due to under-subtraction, distortion in the form of musical noise & clipping due to over-subtraction. Generalized spectral subtraction (Beroutiet al. 1979) |Yn(k)|= β1/γDn(k), if |Xn(k)| < (α+β)1/γDn(k) [ |Xn(k)|γ – α(Dn(k))γ]1/γotherwise γ= exponent factor (2: power subtraction, 1: magnitude subtraction) α= over-subtraction factor (for limiting the effect of short-term variations in noise spectrum) β = floor factor to mask the musical noise due to over-subtraction Re-synthesis with noisy phase without explicit phase calculation Yn(k) = |Yn(k)| Xn(k) / |Xn(k)|
Dynamic estimation of noise magnitude spectrum • Min. statistics based est. (Martin 1994): SNR-dependent over-subtraction factor. • Median based est. (Stahl et al. 2000): Large computation & memory. • Pseudo-median based est. (Basha & Pandey 2012) : moving median approximated by p-point q-stage cascaded-median, with a saving in memory & computation for real-time implementation. Median Storage per freq. bin Sortings per frame per freq. bin M-point 2M (M–1)/2 p-pontq-stage (M =pq) pqp(p–1)/2 Condition for reducing sorting operations and storage: low p, q ≈ ln(M)
Re-synthesis of enhanced signal • Spectral subtraction → Enhanced magnitude spectrum • Enh. magnitude spectrum & original phase spectrum → Complex spectrum • Resynthesis using IFFT and overlap-add • Investigations using offline implementation (fs= 12 kHz, frame = 30 ms) • Overlap of 50% & 75% : indistinguishable outputs • FFT length N = 512 & higher : indistinguishable outputs • γ = 1 (magnitude subtraction) : higher tolerance to variation in α, βvalues • Duration needed for dynamic noise estimation ≈ 1 s • p = 3 for simplifying programming and reducing the sorting operations • 3-frame 4-stage cascaded-median (M=81, p=3, q=4) , 50% overlap: moving median over 1.215 s • Outputs using true-median & cascaded-median: indistinguishable • Reduction in storage requirement per freq. bin: from 162 to 12 samples • Reduction in number of sorting operations per frame per freq. bin: from 40 to 3
3. Implementationfor Real-time Processing • 16-bit fixed point DSP: TI/TMS320C5515 • 16 MB memory space : 320 KB on-chip RAM with 64 KB dual access RAM, 128 KB on-chip ROM • Three 32-bit programmable timers, 4 DMA controllers each with 4 channels • FFT hardware accelerator (8 to 1024-point FFT) • Max. clock speed: 120 MHz • DSP Board: eZdsp • 4 MB on-board NOR flash for user program • Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization , 8 – 192 kHz sampling • Development environment for C: TI's 'CCStudio, ver. 4.0'
Implementation • One codec channel (ADC and DAC) with 16-bit quantization • Sampling frequency: 12 kHz • Window length of 30 ms (L = 360) with 50% overlap, FFT length N= 512 • Storage of input samples, spectral values, processed samples: 16-bit real & 16-bit imaginary parts
Data transfers and buffering operations (S = L/2) • DMA cyclic buffers • 3 block input buffer • 2 block output buffer • (each with S samples) • Pointers • current input block • just-filled input block • current output block • write-to output block • (incremented cyclically on DMA interrupt) • Signal delay • Algorithmic: • 1 frame (30 ms) • Computational ≤ • frame shift (15 ms)
4. Test Results • Test material • Speech: Recording with three isolated vowels, a Hindi sentence, an English sentence (-/a/-/i/-/u/– “aayiyeaapkaanaamkyaahai?” – “Where were you a year ago?”) from a male speaker. • Noise: white, pink, babble, car, and train noises (AURORA ). SNR: ∞, 15, 12, 9, 6, 3, 0, -3, -6 dB. • Evaluation methods • Informal listening • Objective evaluation using PESQ measure (0 – 4.5) • Results: Offline processing • Processing parameters: β= 0.001, α: 1.5 – 2.5. • Informal listening: no audible roughness in the enhanced speech, speech clipping at larger α.
PESQ score vs SNR: noisy & enhanced speech SNR advantage : 13 dB for white noise, 4 dB for babble
Processing examples & PESQ scores • Speech material: speech-speech-silence-speech, speech: /-a-i-u-/––"ayiye"–– "aapkaanaamkyaahai?"––" where were you a year ago?” • Processing parameters: Frame length = 30 ms, Overlap = 50%, β = 0.001, noise estimation by 3-point 4-stage casc. median (estimation duration = 1.215 s)
Example: “Where were you a year ago”, white noise, input SNR = 3 dB. (a) Clean speech (c) Offline processed (b) Noisy speech (d) Real-time processed More examples: http://www.ee.iitb.ac.in/~spilab/material/santosh/ncc2013
Results: Real-time processing • Real-time processing tested using white, babble, car, pink, train noises: • real-time processed output perceptually similar to the offline processed output • Signal delay = 48 ms • Lowest clock for satisfactory operation = 16.4 MHz → Processing capacity used ≈ 1/7 of the capacity with highest clock (120 MHz)
5. Summary & Conclusions • Proposed technique for suppression of additive noise: cascaded-median based dynamic noise estimation for reducing computation and memory requirement for real-time operation. • Enhancement of speech with different types of additive stationary and non-stationary noise: SNR advantage (at PESQ score = 2.5): 4 – 13 dB, Increase in PESQ score (at SNR = 0dB): 0.37 – 0.86. • Implementation for real-time operation using 16-bit fixed-point processor TI/TMS320C5515: used-up processing capacity ≈ 1/7, delay = 48 ms. • Further work • Frequency & a posteriori SNR-dependent subtraction & spectral floor factors • Combination of speech enhancement technique with other processing techniques in the sensory aids • Implementation using other processors
Abstract A spectral subtraction technique is presented for real-time speech enhancement in the aids used by hearing impaired listeners. For reducing computational complexity and memory requirement, it uses a cascaded-median based estimation of the noise spectrum without voice activity detection. The technique is implemented and tested for satisfactory real-time operation, with sampling frequency of 12 kHz, processing using window length of 30 ms with 50% overlap, and noise estimation by 3-frame 4-stage cascaded-median, on a 16-bit fixed-point DSP processor with on-chip FFT hardware. Enhancement of speech with different types of additive stationary and non-stationary noise resulted in SNR advantage of 4 – 13 dB.
References [1]H. Levitt, J. M. Pickett, and R. A. Houde (eds.), Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980. [2] J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. [3] H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. [4] T. Lunner, S. Arlinger, and J. Hellgren, “8-channel digital filter bank for hearing aid use: preliminary results in monaural, diotic, and dichotic modes,” Scand. Audiol. Suppl., vol. 38, pp. 75–81, 1993. [5] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss,” Int. J. Audiol., vol. 51, no. 4, pp. 334–344, 2012. • [6]J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. [7] T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust., 2004, Kyoto, Japan, pp. 1389–1392. [8] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss,” Speech Commun., vol. 54, no. 3 pp. 341–350, 2012. • [9] P. C. Loizou, "Speech processing in vocoder-centric cochlear implants," in A. R. Moller (ed.), Cochlear and Brainstem Implants, Adv. Otorhinolaryngol. vol. 64, Basel: Karger, 2006, pp. 109–143. • [10] P. C. Loizou, Speech Enhancement: Theory and Practice. New York: CRC, 2007. • [11] R. Martin, “Spectral subtraction based on minimum statistics,” in Proc. Eur. Signal Process. Conf., 1994, pp. 1182-1185. • [12] I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003. • [13] H. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 153-156.
[14] V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP, 2000, pp. 1875-1878. [15] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP, 1979, pp. 208-211. [16] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979. [17] Y. Lu and P. C. Loizou, “A geometric approach to spectral subtraction,” Speech Commun., vol. 50, no. 6, pp. 453-466, 2008. • [18] S. K. Basha and P. C. Pandey, “Real-time enhancement of electrolaryngeal speech by spectral subtraction,” in Proc. Nat. Conf. Commun. 2012, Kharagpur, India, pp. 516-520. • [19] K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech Commun., vol. 52, no. 5, pp. 450–475, 2010. • [20] Texas Instruments, Inc., “TMS320C5515 Fixed-Point Digital Signal Processor,” 2011, [online] Available: focus.ti.com/lit/ds/symlink/ tms320c5515.pdf. • [21] Spectrum Digital, Inc., “TMS320C5515eZdsp USB Stick Technical Reference,” 2010, [online] Available: support.spectrumdigital.com/boards/usbstk5515/reva/files/ usbstk5515_TechRef_RevA.pdf • [22] Texas Instruments, Inc., “TLV320AIC3204 Ultra Low Power Stereo Audio Codec,” 2008, [online] Available: focus.ti.com/lit/ds/ symlink/tlv320aic3204.pdf. • [23] ITU, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Rec., P.862, 2001. • [24] S. K. Waddi, “Speech enhancement results”, 2013, [online] Available: www.ee.iitb.ac.in/ ~spilab/material/santosh/ncc2013.