210 likes | 390 Views
Detection of Burst Onset Landmarks in Speech Using Rate of Change of Spectral Moments A. R. Jayan P. S. Rajath Bhat P. C. Pandey { arjayan , rajathbhat , pcpandey }@ ee.iitb.ac.in EE Dept, IIT Bombay 30 th January, 2011. PRESENTATION OUTLINE. 1. Introduction Speech landmarks
E N D
Detection of Burst Onset Landmarks in Speech Using Rate of Change of Spectral Moments A. R. Jayan P. S. Rajath Bhat P. C. Pandey {arjayan, rajathbhat, pcpandey}@ee.iitb.ac.in EE Dept, IIT Bombay 30th January, 2011
PRESENTATION OUTLINE 1. Introduction Speech landmarks Landmark detection Clear speech Automated speech intelligibility enhancement 2. Methodology Band energy parameters Spectral moments Rate of change function 3. Evaluation and results VCV utterances Sentences 4. Conclusion
1. INTRODUCTION Speech landmarks Regions, associated with spectral transitions, containing important information for speech perception Landmarks and related events [Park, 2008]
Landmark detection Processing Extraction of parameters characterizing the landmark Computation of the rate of change (ROC) of parameters Locating the landmark using ROC(s) Applications Intelligibility enhancement Speech recognition Vocal tract shape estimation
Clear speech Speech produced with clear articulation when talking to a hearing-impaired listener, or in a noisy environment More intelligible for ▪ Hearing impaired listeners (~17% higher, Picheny et al.,1985) ▪ Listeners in noisy environments (Payton et al., 1994) ▪ Non-native listeners (Bradlow and Bent, 2002) ▪ Children with learning disabilities (Bradlow et al., 2003) Pronounced acoustic landmarks
Example: ‘The book tells a story’ (Recordings from http://www.acoustics.org/press/145th/clr-spch-tab.htm) Conv. Clear
Automated speech intelligibility enhancement Automated detection of landmarks High detection rate with low false detections Good temporal accuracy (5-10 ms) Computational efficiency Modification of speech characteristics Intensity / duration / spectral modifications around landmarks with minimal perceptual distortions of the acoustic cues in the speech signal
Problems in stop consonant perception Transient sound with low intensity Severely affected by noise / hearing impairment Stop landmarks: Closure Burst onset Onset of voicing Example: /apa/
Some of the earlier landmark detection techniques Liu (1996): Rate-of-rise measures of parameters from a set of fixed spectral bands (Speech recognition, g, s, b landmarks, 80 TIMIT sentences, detection rate: 84 % at 20-30 ms, 50 % at 5-10 ms) Salomon et al. (2002): Temporal parameters related to periodicity, envelope, spectral fine structure (Speech recognition, onsets and offsets of vowels, sonorants, & consonants, 120 TIMIT sentences, detection rate: 90 % at 20 ms) Sainath and Hazan (2006): Sinusoidal model parameters (Speech segmentation,453 TIMIT sentences, word error rates: 20 % ) Niyogi & Sondhi (2002): Stop landmark detection using total energy, energy above 3 kHz & Wiener entropy(Speech recognition, stop consonants, 320 TIMIT sentences,detection rate: 90 % at 20 ms) Jayan & Pandey (2009): Stop landmark detection using GMM parameters(Speech enhancement, 50 TIMIT sentences, detection rate: 73 % at 5 ms)
Improving landmark detection Parameters ▪ Capturing spectral transitions ▪ Adaptation to speech variability Rate of change measure ▪ Range of parameter variations ▪ Correlation among parameters Adaptive time steps ▪ Small time step for abrupt variations ▪ Large time step for slow variations Objective of the present investigation Detection of burst landmarks for automated intelligibility enhancement
2. METHODOLOGY • Band energy parameters • Log of spectral peaks in three bands • ▪ b1: 1.2-2.0 kHz ▪ b2: 2.0-3.5 kHz ▪ b3: 3.5-5.0 kHz • Mag. spectrum (10 kHz sampling) computed using 512-point DFT, 6 ms Hanning window, 1 frame per ms, and smoothed by 20-point moving average. • Smoothed mag. spectrum X(n, k) used for calculating log of spectral peak in band i n= time index, k=frequency index
Example:Band energy parameters for /aga/ (a) Speech waveform (b) Band energy's Time (ms)
Spectral momentsNormalized spectrum n= time index, k=frequency index, N =DFT size Centroid :frequency of energy concentration Variance :spread of energy around the centroid Skewness :measure of spectral symmetry Kurtosis :measure of spectral peakiness
Example:Band energy parameters & spectral moments for /aga/ (a) Waveform (b) (c) (d) Time (ms)
Measures of rate of change ●First difference based rate of change (ROC) K = time step ● Mahalanobis distance based rate of change (ROC-MD) A single measure indicative of the overall variation, taking care of parameter range and correlation effects y(n) = parameter set at time n K = time step = covariance matrix, pre-calculated using the parameter set from segments with energy above a threshold
Detection of voicing offset and onset ▪ Band energy in 0-400 Hz ▪ ROC(n) computed with time step 50 ms ▪ Voicing offset [g-] : ROC(n) -12 dB ▪ Voicing onset [g+] : ROC(n) +12 dB Burst onset landmark detection Most prominent peak in the ROC-MD(n) between g- and g+ Example /aga/ (a) Waveform (b) ROC-MD (c) ROC Time (ms)
3. EVALUTATION & RESULTS Effects of rate of change functions & parameters on burst detection ROC and parameters 1)ROC(BE):Sum of normalized ROCs of [Eb1, Eb2, Eb3] 2)ROC-MD(BE): ROC-MD of [Eb1, Eb2, Eb3] 3)ROC-MD(SM): ROC-MD of [Fc, F,Fk , Fs] 4)ROC-MD(BE,SM): ROC-MD of [Eb1, Eb2, Eb3, Fc , F , Fk , Fs] Material:VCV utterances, TIMIT sentences Time steps:3, 6 ms Temporal accuracies:3, 5, 10, 15, 20 ms
VCV utterances ▪ 6 stop consonants (b, d, g, p, t, k) ▪ 3 vowel contexts (a, i, u) ▪ 10 speakers (5 M, 5 F) ▪ 180 tokens
TIMIT Sentences ▪ 5 speakers (2 M, 3 F) ▪ 10 sentences from each speaker ▪ 238 tokens
4. CONCLUSION Increase in time steps reduced detection accuracy. Mahalanobis distance based ROC was more effective than first-difference based rate of change. Spectral moments were useful as additional parameters in improving burst-onset detection.