Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation • Project Description • What is Voice Onset Time (VOT)? • Physical Realization • Linguistic Significance • Motivation for studying VOT • Methodology for automatically analyzing VOT contrasts • Evaluation Method • Results • Discussion

Project Description • Automatically distinguish whether a voiceless stop consonant is pronounced with a native or accented pronunciation based on voice onset time characteristics. • Use data from the Tball corpus: ESL children doing oral reading tasks. • Evaluate different methods of accomplishing this. • State duration measurements • Explicit modeling of aspiration • Model probablility discrimination

What is VOT? • Voice onset time is defined for stops • e.g. /p,b,t,d,k,g/ • It is the inverval between the release of closure of an articulator (the transient “burst”) and the start of voicing. • VOT has a continuum of values: • When the start of voicing precedes the release of closure for a stop, the VOT takes on a negative value. • When the release of closure and onset of voicing are coincident, VOT is zero. • When voicing comes after release of closure, VOT is positive.

Physical Realization of VOT • Stop consonants are produced with a closure of the vocal tract at a specific point, the place of articulation • During the closure, there is a build up of sub-laryngeal pressure. • When the closure is released there is a transient burst of air, frication due to turbulence at the place of articulation, aspiration noise from turbulence at the glottis • Voicing may occur before, during, or after the closure.

Linguistic Significance of VOT • VOT distinguishes consonants with the same place of articulation (/p/ vs. /b/, /t/ vs. /d/, etc.) • However, different languages use different VOT intervals in contrasts (e.g. “taco”, “pasta”). • English voiceless stops: VOT= +40-50 ms • Spanish voiceless stops: VOT= near zero • English voiced stops: VOT = near zero • Spanish voiced stops: negative VOT (voicing before closure

Linguistic Significance Cont'd • In English, voiceless stops are have a long VOT at the beginning of a word and before stressed vowels, so aspiration is a perceptual cue to word boundaries and stress • Since the frication and aspiration during the VOT is due to build up of pressure from the lungs, it may correspond with emphasis.

Motivation for Studying VOT • This study was motivated by a desire to determine if a phone was pronounced with a non-standard pronuniation • Other reasons to study VOT • It is an important contrastive feature • It gives information about stess • It gives information about word segmentation • It may give information about emphasis

Methodology • Baseline: use duration measurements from a forced alignment. • Insert an /h/ symbol in the transcriptions with standard pronunciation, train accordingly and decode the test files to see if the /h/ phone is recognized. • Cut out the phones of interest from the audio file, train separate models and a combined model, and evaluate the likelihood of the separate models w.r.t. the combined model.

Methodology (cont'd) • The data was transcribed by ear with special symbols for non-standard pronunciations. • b/c the data for non standard pronunciatons was sparse, the symbol for dental /t/ was included as short VOT. • Standard 3 state HMM models • 4 mixtures, T-state silence model • Different frame rates were tested • Bootstrap and flat start methods were tested

Evaluation Method • The evaluation metric used was the error rate for both classes evaluated separately. • This was necessary because the there were much fewer instances of the non-standard pronunciations. • When using thresholds, the point of equal error rate for both classes was used. • This was necessary b/c moving the threshold would tilt the error rate toward one class or the other.

Results • Baseline method error rates: • p: 55% t:23% k:29% • p: 19% t:20% k:48% using duration of 3rd HMM state • With aspiration model: • ShortVOT/ LongVOT • p: 5% / 36% • t: 11% / 38% • k: 57% / 17% • With probability comparision: • p: 36% / 4% • t: 0% / 5% • k: 0% / 6% • (trained on test data—over trained?)

Discussion • Studies have noted that for VOT k>t>p • This could explain why the baseline gets poor results for p • and why the aspiration model predicts the short VOT class best for /p,t/ but predicts the long VOT class best for /k/ • Roughly, each method increased in difficulty. • The results improved from the baseline, but the last approach (comparing probabilities) may have been over-trained. • Comparing probabilities may be easier to extend to other pronunciation modeling tasks.

Discussion • Increasing the frame rate didn't help much. • Don't use a 1ms frame rate Unless you want to test your patience. • If an Inintial consonant has a short VOT, this does not necessarily imply non-standard accent. • Words like “today” and “together” have stress on the 2nd syllable, so the VOT of the initial consonant is shorter for even for standard pronunciation.

Conclusion • When classifying stop consonants based on VOT characteristics, different approaches work better on different stops • Measuring duration of stop state works reasonably well for /t,k/ b/c longer VOT than /p/. • Detecting insertion of an aspiration model during decoding works well for /p,t/ but not k, which has too many false positives. • Comparing phone probabilities worked well except for unaspirated /p/

Future Work • Since VOT is a time/timing related phenomenon, it may help to explicitly model the state duration density in the HMMs. • Other optimization criteria might be be better suited than maximum likelihood extimation to train models for this purpose

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation

Presentation Transcript

Voice Recognition

Cortical auditory evoked potential correlates of categorical perception of voice-onset time

Voice Onset Time + Voice Quality

Voice Onset Time + Voice Quality

Voice Recognition

Voice Recognition

Neural correlates of nonmonotonic temporal acuity for voice onset time

Voice Recognition

Voice Recognition

Underspecified feature models for pronunciation variation in ASR

Voice Recognition

Time variation

Voice Onset Time In Chinese Learners of English

Automatic Detection of Voice Onset Time Contrasts For Use in Pronunciation Assessment

The Hybrid System of Voice Onset Time in French/English Bilinguals

Voice Onset Time as a Parameter for Identification of Bilinguals

+ Voice Recognition

Voice Onset Time (VOT)

DETECTING FATIGUE FROM VOICE USING SPEECH RECOGNITION

Fricatives + Voice Onset Time

voice recognition