Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan P. C. Pandey V. K. Pandey {arjayan, pcpandey,vinod}@ee.iitb.ac.in EE Dept, IIT Bombay 3rd February, 2008

PRESENTATION OUTLINE • Introduction •  Acoustic properties of clear speech •  Landmark detection •  Need for high time resolution • Automated landmark detection with high resolution •  Pass 1  Pass 2 • 3. Experimental results • 4. Summary and conclusion

Conversational Clear ‘the book tells a story’ ‘the boy forgot his book’ • 1. INTRODUCTION • Acoustic properties of clear speech • Clear speech:Speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments • Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm Intelligibility of clear speech ▪ Picheny et al. ,1985: ~17% more intelligible than conversational speech ▪ More intelligible for different classes of listeners & listening conditions

Acoustic differences between clear and conversational speech • Sentence level • ▪ Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) • ▪Larger variation in fundamental frequency • ▪Increased number of pauses, more pause durations •  Word level • ▪Less sound deletions • ▪More sound insertions  Phonetic level ▪Context dependent, non-linear increase in segment durations ▪More targeted vowel formants ▪Increase in consonant intensity

Improvement in intelligibility of conversational speech by incorporating properties of clear speech • Consonant–vowel intensity ratio (CVR) enhancement • Increasing energy of consonant segment •  Consonant duration enhancement • Increasing CV and VC transitions (burst duration, VOT, formant transition) • Challenges •  Accurate detection of regions for modification •  Analysis-modification-synthesis with low processing artifacts •  Processing without increasing overall speaking rate, increase in transition regions with a corresponding dicrease in srteady state segments

Intelligibility enhancement using properties of clear speech • Hazan & simpson, 1998 •  manually labeled VCV and sentences •  intensity modification of stop burst + 12 dB, frication + 6dB, nasal + 6dB •  spectral modification by filtering • Colotte & Laprie, 2000 • automated method for identifying regions based on mel-cepstral analysis • stops and unvoiced fricatives amplified by +4 dB • transition segments time-scaled by 1.8, 2.0 (TD-PSOLA)

Landmark detection Speech landmarks  Regions containing important information for speech perception  Associated with spectral transitions • Landmarks types • 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators • 2. Abrupt (A) -Fast glottal or velum activity • 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction • 4. Vocalic (V) - Vowel landmarks • Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%)

Landmarks

Liu, 1996 ▪ Based on energy variation in 6 spectral bands 0-0.4, 0.8-1.5, 1.2-2.0, 2.0-3.5, 3.5-5.0, 5.0-8 kHz ▪ Parameter: First difference of maximum energy (log) in each spectral band time-step = 50 ms in coarse level, 26 ms in fine level ▪ Matching of peaks across bands for locating boundaries ▪ Detects glottal, sonorant closures, releases, stop closures, releases Application: Extraction of features for supporting speech recognition

88 % 83 % 73 % 44 % Detection rate vs. temporal resolution Uses same processing for all types of landmarks

Niyogi & Sondhi, 2002  for stop consonants  total energy & energy above 3 k Hz in log scale  measure of spectral flatness  non-linear operator optimized for burst detection Salomon et al., 2002  Hilbert transform based envelope to extract temporal parameters  spectral information  adaptive time-steps (5 ms for burst onset, 30 ms for frication, 2 х pitch period for periodic regions) Alani & Deriche, 1999  wavelet transform based decomposition energy variations in 6 bands

Need for high temporal resolution and detection rate  Application dependent  Speech recognition: Analysis is performed around landmarks for parameter extraction ▪ high accuracy ▪ moderate temporal resolution (20-30 ms)  Intelligibility enhancement: Modify landmark regions ▪ high temporal resolution (< 5 ms) ▪ some tolerance to detection errors, but low tolerance to insertions as insertions may introduce distortions Landmark type ▪ Short duration events (bursts) need high time resolution ▪ voicing onsets/offsets may not require this much resolution as signal properties remain same for a long duration

Factors limiting detection rate and temporal resolution ▪ Effectiveness of parameters in capturing acoustic variations • ▪short-time energy variation in spectral bands • weak burst may not get detected • ▪ centroid frequency • not well defined during low energy segments • ▪ fixed band boundaries • may not adapt to speech variability ▪ Smoothening performed during parameter extraction ▪ temporal smoothening on spectrum affects time resolution ▪ Type of distance measure ▪first difference operation not optimized for all types of landmarks ▪time-step 10 ms is too high for burst detection ▪ Effect of noise on parameters

 Acoustic cues for the different phonetic events are distributed • non-homogeneously in the time-frequency plane  Separate detectors are required for each phonetic class  Each detector must use a method most suited for the phonetic event • Objective • Automated detection of landmarks for stop consonants with high temporal resolution, for applications in speech intelligibility enhancement

2. AUTOMATED LANDMARK DETECTION

Landmark detection using spectral peaks and centroids Pass 1 Spectrum divided into five non-overlapping bands ▪ 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz ▪ Sampling frequency 10 k samples/s, ▪ 512-point FFT on 6 ms frames ▪ frame rate 1 ms. Parameters ▪maximum energy in each spectral band, every 1 ms ▪band centroids estimated in each band, every 1 ms ▪ features similar to formant peaks and formant frequencies ▪ can be estimated easily ▪ not much affected by noise

 Peak energy  Centroid frequency  Rate-of-rise functions  Transition index  tracks simultaneous variation of energy and centroid  centroids given less weighting in low energy areas

0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid contours Example: /uka/

0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Peak & centroid ROR contours Example: /uka/ Time step = 26 ms

Transition index Example: /uka/ derived from RORs with time step = 26 ms

Transition index Example: /uka/ derived from RORs with time step = 4 ms Less sensitive to slow transitions

Problems Large time step ( > 20 ms) ▪ detects with less temporal accuracy ▪ detects slowly varying events also (more detection rate) Small time step (< 5 ms) ▪ detects abrupt transitions with good resolution ▪ misses slow transitions. Pass 2: Analyze landmarks detected in Pass 1 with a small time-step

Improving Temporal resolution : Pass 2 ▪ 40 ms window centered around burst landmarks detected in pass 1 ▪ decomposed to 6 levels by discrete Meyer Wavelet ▪ detail (high frequency) contents in the lower two levels used for localizing bursts Parameters ▪ short time energy variation ▪ zero crossing rate Compute normalized RORs with a time-step of 3 ms Get a new transition index as Relocate landmark to the location corresponding to the peak in Tez(n)

Relocating stop landmarks

3. EXPERIMENTAL RESULTS Test material: VCV syllables ▪2 speakers (1 male, 1 female) ▪3 stop consonants (/p/, /t/, /k/) ▪3 initial and 3 final vowel contexts (/a/, /i/, /u/) ▪Total 54 tokens Pass 1 Pass 2

Test material: TIMIT sentences ▪5 speakers (2 male, 3 female) ▪ 10 sentences per speaker ▪ closure and burst onsets of /b/, /d/, /g/, /p/, /t/, /k/ ▪ total 418 tokens Detection rates Localization error

4. SUMMARY & CONCLUSION Pass 2 improves temporal resolution of stop landmarks ▪Significant improvement in stop burst localization in VCV syllables 30% improvement for 5 ms resolution ▪Marginal improvement in sentences 4 % improvement for stop landmarks at 10 ms resolution Possible reasons ▪reduced closure duration in sentences ▪ unreleased bursts ▪ errors in Pass 1 may be above 30 ms ▪use of 40 ms window in Pass 2, may need modification ▪ errors in the manual labels ▪ Future work: Evaluation of the method in presence of noise

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

Detection of Acoustic Landmarks with High Resolution for Speech Processing A. R. Jayan

Presentation Transcript

Speech Processing

Acoustic detection of high energy particle showers

Audio and Speech Processing Topic 5: Acoustic Feedback Control

Speech Processing

Status of Acoustic Detection

Signal Processing For Acoustic Neutrino Detection (A Tutorial)

A High-Resolution

Development of a compact acoustic calibrator for ultra-high energy neutrino detection

High resolution detection of IBD

Detection of Burst Onset Landmarks in Speech Using Rate of Change of Spectral Moments

Acoustic Detection of Ultra-High Energy Neutrinos

Speech Processing

Speech Processing

Acoustic Modeling for Speech Recognition

Signal Processing For Acoustic Neutrino Detection (A Tutorial)

Speech Processing

Evaluation of Speech Detection

Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition

Speech Information at Acoustic Landmarks