290 likes | 498 Views
Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008. PRESENTATION OUTLINE. Introduction Acoustic properties of clear speech
E N D
Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3rd February, 2008
PRESENTATION OUTLINE • Introduction • Acoustic properties of clear speech • Landmark detection • Need for high time resolution • Automated landmark detection with high resolution • Pass 1 Pass 2 • 3. Experimental results • 4. Summary and conclusion
Conversational Clear ‘the book tells a story’ ‘the boy forgot his book’ • 1. INTRODUCTION • Acoustic properties of clear speech • Clear speech:Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments • Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm Intelligibility of clear speech ▪ Picheny et al. ,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions
Acoustic differences between clear and conversational speech • Sentence level • ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) • ▪Larger variation in fundamental frequency • ▪Increased number of pauses, more pause durations • Word level • ▪Less sound deletions • ▪More sound insertions Phonetic level ▪Context dependent, non-linear increase in segment durations ▪More targeted vowel formants ▪Increase in consonant intensity
Improvement in intelligibility of conversational speech by incorporating properties of clear speech • Consonant–vowel intensity ratio (CVR) enhancement • Increasing energy of consonant segment • Consonant duration enhancement • Increasing CV and VC transitions (burst duration, VOT, formant transition) • Challenges • Accurate detection of regions for modification • Analysis-modification-synthesis with low processing artifacts • Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments
Intelligibility enhancement using properties of clear speech • Hazan & simpson, 1998 • manually labeled VCV and sentences • intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB • spectral modification by filtering • Colotte & Laprie, 2000 • automated method for identifying regions based on mel-cepstral analysis • stops and unvoiced fricatives amplified by +4 dB • transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)
Landmark detection Speech landmarks Regions containing important information for speech perception Associated with spectral transitions • Landmarks types • 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators • 2. Abrupt (A) -Fast glottal or velum activity • 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction • 4. Vocalic (V) - Vowel landmarks • Abrupt (~68%) Vocalic (~29%) Non-abrupt (~3%)
Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application: Extraction of features for supporting speech recognition
88 % 83 % 73 % 44 % Detection rate vs. temporal resolution Uses same processing for all types of landmarks
Niyogi & Sondhi, 2002 for stop consonants total energy & energy above 3 k Hz in log scale measure of spectral flatness non-linear operator optimized for burst detection Salomon et al., 2002 Hilbert transform based envelope to extract temporal parameters spectral information adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999 wavelet transform based decomposition energy variations in 6 bands
Need for high temporal resolution and detection rate Application dependent Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms) Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration
Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations • ▪short-time energy variation in spectral bands • weak burst may not get detected • ▪ centroid frequency • not well defined during low energy segments • ▪ fixed band boundaries • may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪first difference operation not optimized for all types of landmarks ▪time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters
Acoustic cues for the different phonetic events are distributed • non-homogeneously in the time-frequency plane Separate detectors are required for each phonetic class Each detector must use a method most suited for the phonetic event • Objective • Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement
2. AUTOMATED LANDMARK DETECTION
Landmark detection using spectral peaks and centroids Pass 1 Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪maximum energy in each spectral band, every 1 ms ▪band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ can be estimated easily ▪ not much affected by noise
Peak energy Centroid frequency Rate-of-rise functions Transition index tracks simultaneous variation of energy and centroid centroids given less weighting in low energy areas
0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid contours Example: /uka/
0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid ROR contours Example: /uka/ Time step = 26 ms
Transition index Example: /uka/ derived from RORs with time step = 26 ms
Transition index Example: /uka/ derived from RORs with time step = 4 ms Less sensitive to slow transitions
Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step
Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in Tez(n)
3. EXPERIMENTAL RESULTS Test material: VCV syllables ▪2 speakers (1 male, 1 female) ▪3 stop consonants (/p/, /t/, /k/) ▪3 initial and 3 final vowel contexts (/a/, /i/, /u/) ▪Total 54 tokens Pass 1 Pass 2
Test material: TIMIT sentences ▪5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of /b/, /d/, /g/, /p/, /t/, /k/ ▪ total 418 tokens Detection rates Localization error
4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise