1 / 29

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3 rd February, 2008. PRESENTATION OUTLINE. Introduction  Acoustic properties of clear speech

hazina
Download Presentation

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3rd February, 2008

  2. PRESENTATION OUTLINE • Introduction •  Acoustic properties of clear speech •  Landmark detection •  Need for high time resolution • Automated landmark detection with high resolution •  Pass 1  Pass 2 • 3. Experimental results • 4. Summary and conclusion

  3. Conversational Clear ‘the book tells a story’ ‘the boy forgot his book’ • 1. INTRODUCTION • Acoustic properties of clear speech • Clear speech:Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments • Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm Intelligibility of clear speech ▪ Picheny et al. ,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions

  4. Acoustic differences between clear and conversational speech • Sentence level • ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) • ▪Larger variation in fundamental frequency • ▪Increased number of pauses, more pause durations •  Word level • ▪Less sound deletions • ▪More sound insertions  Phonetic level ▪Context dependent, non-linear increase in segment durations ▪More targeted vowel formants ▪Increase in consonant intensity

  5. Improvement in intelligibility of conversational speech by incorporating properties of clear speech • Consonant–vowel intensity ratio (CVR) enhancement • Increasing energy of consonant segment •  Consonant duration enhancement • Increasing CV and VC transitions (burst duration, VOT, formant transition) • Challenges •  Accurate detection of regions for modification •  Analysis-modification-synthesis with low processing artifacts •  Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments

  6. Intelligibility enhancement using properties of clear speech • Hazan & simpson, 1998 •  manually labeled VCV and sentences •  intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB •  spectral modification by filtering • Colotte & Laprie, 2000 • automated method for identifying regions based on mel-cepstral analysis • stops and unvoiced fricatives amplified by +4 dB • transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

  7. Landmark detection Speech landmarks  Regions containing important information for speech perception  Associated with spectral transitions • Landmarks types • 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators • 2. Abrupt (A) -Fast glottal or velum activity • 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction • 4. Vocalic (V) - Vowel landmarks • Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%)

  8. Landmarks

  9. Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application: Extraction of features for supporting speech recognition

  10. 88 % 83 % 73 % 44 % Detection rate vs. temporal resolution Uses same processing for all types of landmarks

  11. Niyogi & Sondhi, 2002  for stop consonants  total energy & energy above 3 k Hz in log scale  measure of spectral flatness  non-linear operator optimized for burst detection Salomon et al., 2002  Hilbert transform based envelope to extract temporal parameters  spectral information  adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999  wavelet transform based decomposition energy variations in 6 bands

  12. Need for high temporal resolution and detection rate  Application dependent  Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms)  Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration

  13. Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations • ▪short-time energy variation in spectral bands • weak burst may not get detected • ▪ centroid frequency • not well defined during low energy segments • ▪ fixed band boundaries • may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪first difference operation not optimized for all types of landmarks ▪time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters

  14.  Acoustic cues for the different phonetic events are distributed • non-homogeneously in the time-frequency plane  Separate detectors are required for each phonetic class  Each detector must use a method most suited for the phonetic event • Objective • Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement

  15. 2. AUTOMATED LANDMARK DETECTION

  16. Landmark detection using spectral peaks and centroids Pass 1 Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪maximum energy in each spectral band, every 1 ms ▪band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ can be estimated easily ▪ not much affected by noise

  17.  Peak energy  Centroid frequency  Rate-of-rise functions  Transition index  tracks simultaneous variation of energy and centroid  centroids given less weighting in low energy areas

  18. 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid contours Example: /uka/

  19. 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid ROR contours Example: /uka/ Time step = 26 ms

  20. Transition index Example: /uka/ derived from RORs with time step = 26 ms

  21. Transition index Example: /uka/ derived from RORs with time step = 4 ms Less sensitive to slow transitions

  22. Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step

  23. Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in Tez(n)

  24. Relocating stop landmarks

  25. Relocating stop landmarks

  26. Relocating stop landmarks

  27. 3. EXPERIMENTAL RESULTS Test material: VCV syllables ▪2 speakers (1 male, 1 female) ▪3 stop consonants (/p/, /t/, /k/) ▪3 initial and 3 final vowel contexts (/a/, /i/, /u/) ▪Total 54 tokens Pass 1 Pass 2

  28. Test material: TIMIT sentences ▪5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of /b/, /d/, /g/, /p/, /t/, /k/ ▪ total 418 tokens Detection rates Localization error

  29. 4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise

More Related