140 likes | 315 Views
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner sdusan@caip.rutgers.edu Center for Advanced Information Processing Rutgers University Piscataway, New Jersey, U.S.A. ASAT Meeting, Rutgers University, NJ. Oct. 13, 2006. ASAT Meeting, Rutgers University, NJ. Oct. 13, 2006.
E N D
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabinersdusan@caip.rutgers.eduCenter for Advanced Information ProcessingRutgers UniversityPiscataway, New Jersey, U.S.A. ASAT Meeting, Rutgers University, NJ Oct. 13, 2006
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner OUTLINE If more knowledge from speech perception and acoustic-phonetic studies are integrated into ASR these systems should provide better performance. Two types of acoustic-phonetic correlates are evaluated in this study with links to studies of vowel/consonant perception: • Evaluate the distribution of information (or the relevance for vowel classification) of various acoustic patterns and features: • Static MFCC features outside the currently accepted vowel (phoneme) boundaries • Segmental durational features • Dynamical MFCC features with two slopes • Evaluate the temporal correlation between maximum spectral transition positions and phone boundaries: • Compute the spectral transition measure (STM) using static MFCC features and find its peaks for the training part of TIMIT containing 172,460 between-phone boundaries. • Analyze the deviation between these peaks and phone boundaries
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner METHODS • Evaluate the vowel information within and outside vowel boundaries as done by Strange et al. 1976 and Furui, 1986 with listeners but by performing automatic ML classification of 9 vowels coarticulated in three left- and three right-consonant contexts. Evaluate 8 acoustic patterns: • Spectral feature vector at the center of the vowel in CV or VC biphones. • Spectral feature vector at 20 ms after vowel onset in CV biphones or 20 ms before vowel offset in VC biphones. • Spectral feature vector at the CV or VC transition position. • A vector containing the overall slope of each spectral feature computed on a 40 ms interval, centered at the CV or VC transition position. • A vector containing the slopes of each spectral feature computed on 20 ms intervals on the left- and on the right-side of the given CV or VC transition position. This vector can discriminate among the monotonic and non-monotonic spectral transitions between phonemes (Dusan, 2005). • Spectral feature vector at the center of the preceding consonant in CV biphones or the following consonant in VC biphones. • A vector containing the vowel and the consonant durations in CV or VC biphones. This vector accounts for both the intrinsic duration of vowels and the vowel durational effect due to coarticulation with consonants. • Spectral feature vector at the beginning of the consonant in CV biphones or at the end of the consonant in VC biphones.
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner METHODS • Investigate the relation between the perceptual critical points (Furui, 1986) for consonant and syllable identification and the phone boundaries by analyzing the temporal correlation between the maximum spectral transition positions and phone boundaries: • Compute the spectral transition measure (STM) as the dynamic (delta) MFCC features. The dynamic features are computed using the first 10 static MFCC features (excluding the energy). • Find the peaks of the STM for the training part of TIMIT containing 172,460 between-phone boundaries. • Compute the deviation between the positions of the peaks and phone boundaries • Quantify this deviation in bins of 0, 10, 20, 30, and 40 ms. • If the STM peaks are in close proximity to phone boundaries this means that the perceptual critical points are in close proximity to phone boundaries and this could have implications to ASR
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner DISTRIBUTION OF INFORMATION Figure 1.Vowel classification scores in left- and right-consonant contexts for the 8 patterns
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner DISTRIBUTION OF INFORMATION Figure 2.Vowel classification scores for the static MFCC patterns in left- and right-consonant contexts.
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner DISTRIBUTION OF INFORMATION 5.8% (~38% relative error reduction) Figure 3.Vowel classification scores in left-consonant contexts for combinations of all 8 patterns
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner DISTRIBUTION OF INFORMATION 6.9% (~37% relative error reduction) Figure 4.Vowel classification scores in right-consonant contexts for combinations of all 8 patterns
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES Frame step = 10 ms (a) (b) (c) (d) Figure 5. Example 1:(a) Speech with manual phone boundaries, (b) STM with automatically detectedphone boundaries, (c) STM and missed boundaries, (d) STM and inserted boundaries
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES Frame step = 10 ms Table 1. Results of the automatic phone boundary detection based on the STM function. Approximately 85% of the manually located phone boundaries are automatically detected
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES Frame step = 10 ms Figure 6. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries.
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES Frame step = 10 ms Figure 7. Normalized histogram showing the absolute deviation between the145,950 automatically detected boundaries and the corresponding 145,950 manually located boundaries.
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner STM PEAKS AND PHONE BOUNDARIES • An analysis of the time difference between the 145,950 automatically detected boundaries and the corresponding 145,950 manually located phone boundaries is shown in Table 2 Table 2. Percentage of the 145,950 detected boundaries which are within various time spans from the manually located phone boundaries
ASAT Meeting, Rutgers University, NJ Oct. 13, 2006 Sorin Dusan & Larry Rabiner CONCLUSIONS • The new acoustic patterns (two located outside vowel boundaries, the double slope dynamic pattern at the boundary, and the durational pattern) contain significant vowel information. • There is more information about the vowel identity at the beginning (in CV) or end (in VC) of the adjacent consonants than at the center of these consonants. • The STM peaks are in close proximity of phone boundaries: 27% within 0 ms, 70% within 10 ms, 89% within 20 ms, 95% within 30 ms, and 97% within 40 ms from the manually located phone boundaries. • The current study complements Furui’s perceptual study and shows that the phone boundaries are in close proximity to the maximum spectral transition positions and thus to the perceptual critical points.