280 likes | 300 Views
Explore noise suppression and spectral masking through speech enhancement & multi-band compression in hearing aids, improving speech perception.
E N D
Indicon2013, Mumbai, 13-15 Dec. 2013, Paper No. 524 (Track 4.1, Sat., 14th Dec., 1730 – 1900) Speech Enhancement and Multi-band Frequency Compression for Suppression of Noise and Intraspeech Spectral Masking in Hearing Aids NityaTiwari, Santosh K. Waddi, Prem C. Pandey {nitya, pcpandey} @ ee.iitb.ac.in santosh4b6 @ gmail.com IIT Bombay
Overview Introduction Noise Suppression Multi-band Frequency Compression Implementation for Real-time Processing Test Results Summary & Conclusion
1. Introduction • Sensorineural hearing loss • Increased hearing thresholds and high frequency loss • Decreased dynamic range & abnormal loudness growth • Increased spectral & temporal masking • → Degraded speech perception, particularly in noisy environment • Signal processing in hearing aids • Frequency selective amplification • Automatic volume control • Multichannel dynamic range compression (settable attack time, release time, and compression ratios)
Single-input speech enhancement for reducing the background noise (Boll 1979, Beroutiet al.1979, Martin 1994, Loizou 2007, Paliwalet al. 2010) • Dynamic estimation of non-stationary noise spectrum • during non-speech segments using voice activity detection, or • continuously using statistical techniques • Estimation of noise-free speech spectrum • spectral noise subtraction, or • multiplication by noise suppression function • Speech resynthesis • using enhanced magnitude and noisy phase
Multi-band frequency compression for reducing the effect of increased spectral masking (Arai et al. 2004, Kulkarniet al. 2012) • Splitting short-time spectrum into analysis bands and compressing the spectral samples towards the band center, for presenting the speech energy in relatively narrow bands to avoid masking by adjacent spectral components. • Segmentation and spectral analysis • Analysis-synthesis: fixed-frame or pitch-synchronous • Analysis bands: constant bandwidth or auditory critical bandwidth • Spectral modification • Modifying magnitude spectrum with original phase (Arai et al. 2004) • Modifying complex spectrum to reduce computation & processing related artifacts (Kulkarniet al. 2012) • Speech resynthesis using overlap add method
Research objective • Real-time single-input speech enhancement and multi-band frequency compression for improving speech perception by persons with moderate sensorineural loss. • Main challenges • Noise estimation without voice activity detection • Multi-band frequency compression with low processing artifacts • Low signal delay (algorithmic + computational) for real-time application • Low computational complexity & memory requirement for implementation on a low-power processor
Proposed technique • Spectral subtraction using cascaded-median based continuous updating of the noise spectrum, without using voice activity detection • Multi-band frequency compression based on least square error estimation (LSEE) of modified spectrum • Investigations using offline implementation • Selection of processing parameters • Real-time implementation • 16-bit fixed-point DSP with on-chip FFT hardware • Evaluation of the implementations • Informal listening, PESQ measure
2. Noise Suppression Power subtraction Windowed speech spectrum = Xn(k) Estimated noise mag. spectrum = Dn(k) Estimated speech spectrum Yn(k) = [|Xn(k)|2 – (Dn(k))2 ] 0.5e j<Xn(k) Problems:residual noise due to under-subtraction, distortion in the form of musical noise & clipping due to over-subtraction. Generalized spectral subtraction (Beroutiet al. 1979) |Yn(k)|= β1/γDn(k), |Xn(k)| < (α+β)1/γDn(k) [ |Xn(k)|γ – α(Dn(k))γ ]1/γ, otherwise γ:exponent factor, α:over-subtraction factor, β: floor factor Re-synthesis with noisy phase without explicit phase calculation Yn(k) = |Yn(k)| Xn(k) / |Xn(k)|
Dynamic estimation of noise magnitude spectrum • Pseudo-median based estimation (Basha & Pandey 2012) • Moving median approximated by p-point q-stage cascaded-median, with a saving in memory & computation for real-time implementation. • Estimation improved by weighted average of medians from different stages. • Condition for reducing sorting operations and storage: low p, q ≈ ln(M)
Investigations using offline implementation of spectral subtraction • fs: 10 kHz, Frame length: 25.6 ms, Overlap: 75%, FFT size N:512 • Dynamic estimation of noise spectrum: 3-frame 5-stage weighted-average cascaded-median (M=243, p=3, q=5) • Moving median over 1.55 s • Reduction in storage requirement: 486 to 15 samples per freq. bin • Reduction in sorting operations: 121 to 3 per frame per freq. bin • Empirically determined weights for averaging: 0, 0, 0, 0.2, 0.6, 0.2 • Best combination of processing parameters: β = 0.01, α = 1. Speech clipping at larger α.
3. Multi-band Frequency Compression Kulkarniet al. 2012: Compression on complex spectrum to reduce computation & processing related artifact Spectral segment mapping • Segmentation & spectral analysis • Spectral modification with spectral segment mapping • Re-synthesis using overlap-add Edges of input spectrum: a, b, Compression factor: c a = kic – [(kic – (k' – 0.5)) / c], b = a + 1/c Yc(k') = (m – a) Y(m) + Y(j) + (b – n) Y(n)
Results from listening tests (Kulkarniet al. 2012) • Processing for maximum improvement in speech perception • Pitch-synchronous analysis-synthesis • Auditory critical band based compression • c = 0.6 • Evaluation using Modified Rhyme Test on 8 hearing impaired subjects with moderate loss • Increase of 16.5% in recognition score • Decrease of 0.89 s in response time • Problems in implementation for real-time processing • Fixed-frame analysis-synthesis: perceptible distortions • Pitch-synchronous analysis-synthesis: delay and computational complexity incompatible with real-time processing
Proposed solution for integrating multi-band frequency compression and suppression of background noise • Common FFT based analysis-synthesis platform for computational efficiency • Modified fixed-frame multi-band frequency compression using Griffin-Lim method of least-square error based signal estimation from modified STFT (Griffin & Lim, 1984) for avoiding processing artifacts • Windowing, multiplication with analysis window, & FFT • Spectral modification • IFFT, multiplication with analysis window, overlap-add • Window requirement: sum of square of all overlapped window samples should be unity. Modified Hamming window: window length L & shift S = L/4 • w(n) = [1 / √(4d2 + 2e2)][d + e cos(2π (n + 0.5) / L)], where d = 0.54, e= -0.46
Processing steps • Spectral subtraction → Enhanced magnitude spectrum • Enhanced magnitude spectrum & original phase spectrum • → Complex spectrum • Multi-band frequency compression → Compressed complex spectrum • Resynthesis using IFFT and overlap-add • Investigations using offline implementation of modified multi-band frequency compression (fs= 10 kHz, frame length = 25.6 ms,FFT length N = 512) • No perceptible distortions: output of modified fixed-frame processing similar to that from pitch-synchronous processing used by Kulkarniet al., 2012. • Modified fixed-frame processing also suitable for non-speech audio
4. Implementationfor Real-time Processing • 16-bit fixed point DSP: TI/TMS320C5515 • 16 MB memory space : 320 KB on-chip RAM with 64 KB dual access RAM, 128 KB on-chip ROM • Three 32-bit programmable timers, 4 DMA controllers each with 4 channels • FFT hardware accelerator (8 to 1024-point FFT) • Max. clock speed: 120 MHz • DSP Board: eZdsp • 4 MB on-board NOR flash for user program • Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization , 8 – 192 kHz sampling • Development environment for C: TI's 'CCStudio, ver. 4.0'
Implementation • One codec channel (ADC and DAC) with 16-bit quantization • Sampling frequency: 10 kHz • Window length of 25.6 ms (L = 256) with 75% overlap, FFT length N= 512 • Storage of input samples, spectral values, processed samples: 16-bit real & 16-bit imaginary parts
Data transfers and buffering operations (S = L/4) • DMA cyclic buffers • 5 block input buffer • 2 block output buffer • (each with S samples) • Pointers • current input block • just-filled input block • current output block • write-to output block • (incremented cyclically on DMA interrupt) • Signal delay • Algorithmic: • 1 frame (25.6 ms) • Computational ≤ • frame shift (6.4 ms)
5. Test Results • Test material • Speech: “Where were you a year ago?” from a male speaker. • Noise: white, pink, babble, car, and train noises (AURORA ). SNR: ∞, 15, 12, 9, 6, 3, 0, -3, -6 dB. • Evaluation methods • Informal listening • Objective evaluation using PESQ measure • (Scale: 0 – 4.5, acceptable: 2.5)
Results from offline processing (fs= 10 kHz, Frame length = 25.6 ms,FFT size N = 512, β = 0.01, α = 1, c = 0.6) • Informal listening • No audible roughness or distortion in the enhanced and compressed speech • Spectral subtraction • PESQ improvement: 0.37 – 0.86, for input with 0 dB SNR • Equivalent SNR improvement: 4 – 13 dB for PESQ of 2.5 • Multi-band frequency compression • PESQ of modified fixed-frame processing with pitch-synchronous processing as reference: 3.7
Example of spectral subtraction Speech: “Where were you a year ago” Noise: white Input SNR: 3 dB Clean speech Output after noise suppression Noisy speech
Example of multi-band frequency compression Speech: “Where were you a year ago” Noise: white Input SNR: 3 dB c = 0.6 Clean speech Compression on clean speech Compression after noise suppression
Results of real-time processing • Informal listening: real-time output perceptually similar to the offline output • PESQ for real-time w.r.t. offline : 2.5 – 3.4 • Signal delay = 36 ms • Lowest processor clock for satisfactory operation = 39 MHz • → Processing capacity used ≈ 1/3 of the capacity with the highest clock of 120 MHz
6. Summary & Conclusions • Integration of processing techniques to reduce the effects of background noise and increased intraspeech spectral masking associated with sensorineural hearing loss • Cascaded-median weighted-average approximation of moving median for dynamic estimation of noise spectrum for suppression of background noise. • Modified fixed-frame analysis-synthesis for multi-band frequency compression with low computational complexity and without perceptible distortions. • Processing suitable for speech and non-speech audio. • Processing implemented using 16-bit fixed-point DSP chip and tested for satisfactory operation.
Further work • Implementation along with automatic gain control, multi-band amplitude compression, and frequency selective amplification • Listening tests for evaluating the improvement in speech perception
Abstract Sensorineural hearing impairment is associated with increased intraspeech spectral masking and results in degraded speech perception in noisy environment due to increased masking. Speech enhancement using spectral subtraction can be used for suppressing the external noise. Multi-band frequency compression of the complex spectral samples has been reported to reduce the effects of increased intraspeech masking. A combination of these techniques is implemented for real-time processing for improving speech perception by persons with moderate sensorineural loss. For reducing computational complexity and memory requirement, spectral subtraction is carried out using a cascaded-median based estimation of the noise spectrum without voice activity detection. Multi-band frequency compression, based on auditory critical bandwidths, is carried out using fixed-frame processing along with least-squares error based signal estimation to reduce the processing delay. To reduce computational complexity the two processing stages share the FFT based analysis-synthesis. The processing is implemented and tested for satisfactory operation, with sampling frequency of 10 kHz, 25.6 ms window with 75% overlap, using a 16-bit fixed-point DSP processor. The real-time operation is achieved with signal delay of approximately 36 ms and using about one-third of the computing capacity of the processor.
References • [1] H. Levitt, J. M. Pickett, and R. A. Houde, Eds., Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980. • [2] B. C. J. Moore, An Introduction to the Psychology of Hearing, London, UK: Academic, 1997, pp 66–107. • [3] J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. • [4] H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. • [5] T. Baer, B. C. J. Moore, and S. Gatehouse, “Spectral contrast enhancement of speech in noise for listeners with sensorineural hearing impairment: effects on intelligibility, quality, and response times”, Int. J. Rehab. Res., vol. 30, no. 1, pp. 49–72, 1993. • [6] J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. • [7] T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust. (ICA 2004), Kyoto, Japan, 2004, pp. 1389–1392. • [8] K. Yasu, M. Hishitani, T. Arai, and Y. Murahara, “Critical-band based frequency compression for digital hearing aids,” Acoustical Science and Technology, vol. 25, no. 1, pp. 61-63, 2004. • [9] P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss,” Speech Commun., vol. 54, no. 3 pp. 341–350, 2012. • [10] P. C. Loizou, Speech Enhancement: Theory and Practice. New York: CRC, 2007. • [11] R. Martin, “Spectral subtraction based on minimum statistics,” in Proc. 7th Eur. Signal Processing Conf. (EUSIPCO'94), Edinburgh, U.K., 1994, pp. 1182-1185. • [12] I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003. • [13] H. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP 1995, Detroit, MI, pp. 153-156. • [14] V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP 2000, Istanbul, Turkey, pp. 1875-1878.
[15] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP 1979, Washington, DC, pp. 208-211. [16] S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979. [17] Y. Lu and P. C. Loizou, “A geometric approach to spectral subtraction,” Speech Commun., vol. 50, no. 6, pp. 453-466, 2008. [18] K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech Commun., vol. 52, no. 5, pp. 450–475, 2010. [19] S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” in Proc. Nat. Conf. Commun. (NCC 2013), Delhi, India, 2013, paper no. 1569696063. [20] N. Tiwari, P. C. Pandey, and P. N. Kulkarni, “Real-time implementation of multi-band frequency compression for listeners with moderate sensorineural impairment,” in Proc. 13th Annual Conf. of the Int. Speech Commun. Assoc. (Interspeech 2012), Portland, Oregon, 2012, paper no. 689. [21] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Proc., vol. 32, no. 2, pp. 236-243, 1984. [22] Texas Instruments, Inc. (2011) TMS320C5515 Fixed-Point Digital Signal Processor. [online]. Available: focus.ti.com/lit/ds/ symlink/ tms320c5515.pdf [23] Spectrum Digital, Inc. (2010) TMS320C5515 eZdsp USB Stick Technical Reference. [online]. Available: support.spectrum digital.com/boards/usbstk5515/reva/files/usbstk5515_TechRef_RevA.pdf [24] Texas Instruments, Inc. (2008) TLV320AIC3204 Ultra Low Power Stereo Audio Codec. [online]. Available: focus.ti.com/lit/ds/ symlink/ tlv320aic3204.pdf [25] ITU, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Rec., P.862, 2001.