320 likes | 664 Views
An Exemplar-based Approach to Automatic Burst Detection in Voiceless Stops. YAO YAO UC BERKELEY YAOYAO@BERKELEY.EDU http://linguistics.berkeley.edu/~yaoyao JULY 25, 2008. Overview. Background Data Methodology Algorithm Tuning the model Testing Results General Discussion.
E N D
An Exemplar-based Approach to Automatic Burst Detection in Voiceless Stops YAO YAO UC BERKELEY YAOYAO@BERKELEY.EDU http://linguistics.berkeley.edu/~yaoyao JULY 25, 2008
Overview • Background • Data • Methodology • Algorithm • Tuning the model • Testing • Results • General Discussion
Background • Purpose of the study • To find the point of burst in a word initial voiceless stop (i.e. [p], [t], [k]) close release vowel onset • Existing approach • Detecting the point of maximal energy change (cf. Niyogi and Ramesh, 1998; Liu, 1996)
Background • Our approach • Compare the spectrogram of the target token at each point against that of fricatives and silence • Assess how “fricative-like” and “silence-like” the spectrogram is at each time point • Find the point where “fricative-ness” suddenly rises and “silence-ness” suddenly drops point of burst
Background • Our approach (cont’d) • What do we need? • Spectral features of a given time frame • Spectral templates of fricatives and silence • Specific to speaker and the recording environment • Measure and compare fricative-ness and silence-ness • An algorithm to find the most likely point for release • Advantage • Easy to implement • No worries about change in the environment and individual differences
Data • Buckeye corpus (Pitt, M. et al. 2005) • 40 speakers • All residents of Columbus, Ohio • Balanced in gender and age • One-hour interview • Transcribed at word and phone level • 19 used in the current study • Target tokens • Transcribed word-initial voiceless stops (e.g. [p], [t], [k])
Methodology: spectral measures • Spectral vector • 20ms Hamming window • Mel scale • 1 × 60 array • Spectral template • Speaker-specific, phone-specific • Ignore tokens shorter than average duration of that phone of the speaker • For the remaining tokens • Calculate a spectral vector for the middle 20ms window • Average over the spectral vectors
Methodology: spectral template [a] of F01 [f] of F01 Silence of F01
Methodology: similarity scores • Similarity between spectral vectors x and u • Dx,u = • Sx,u = e-0.005Dx,u • Comparing the given acoustic data against any spectral templates of that speaker • Stepsize = 5ms
Similarity scores Formulae: Dx,t = Sx,t =e-0.005Dx,t Step size = 5ms - [s] score - <sil> score
Methodology: finding the release point • Basic idea • Near the release point • - Fricative similarity score rises • - Silence similarity score drops close release vowel onset Q1: Which fricative to use? Q2: Which period of rise or drop to pick?
Methodology : finding the release point • Slope is a better predictor than absolute score value • The end point of a period with maximal slope • the release point • Which fricative? • [sh] score is more consistent than other fricatives • [h] • [s] • [sh] • <sil> similarity scores
Methodology : finding the release point Initial [t] in "doing" Initial [k] in “countries” • [h] • [s] • [sh] • <sil> • [h] • [s] • [sh] • <sil>
Methodology : finding the release point • Original algorithm • Find the end point of a period of fastest increase in <sh> score • Find the end point of a period of fastest decrease in <sil> score • Return the middle point of the two end points as the point of release • If either or both end points cannot be found within the duration of the stop, return NULL.
Methodology : finding the release point • Select two speakers’ data to tune the model • Hand-tag the release point for all tokens in the test set. • If the stop doesn’t appear to have a release point on the spectrogram, mark it as a problematic case, and take the end point of the stop as the release point, for calculating error.
Methodology : problematic cases no burst no closure weak and double release(??) • [sh] • <sil>
Methodology : finding the release point 17 • Calculate the difference between hand-tagged release point and the estimated one (i.e. error) for each case. • RMS (Root Mean Square) of error is used to measure the performance of the algorithm.
Methodology : error analysis F07 ( n=231 tokens) M08 (n=261 tokens) Add 5ms to the estimation 14.ms RMS = 7.22ms RMS = 13.11ms 4.85ms real release-estimate real release-estimate
Methodology: tuning the algorithm • 1st Rejection Rule -- A target token will be rejected if the changes in scores are not drastic enough. • E.g. • [sh] • <sil> Insignificant rise Reject!
Methodology: tuning the algorithm • Applying 1st Rejection Rule • Rejecting 4 cases inF07 • RMS(+5ms) = 4.19ms • Rejecting 28 cases in M08 • covering most of the problematic cases • RMS(+5ms)=9.27ms Error analysis in M08 after 1st rejection rule RMS(+5ms) = 14ms 9.27ms
Methodology : tuning the algorithm Still a problem… • Multiple releases • Each might corresponds to a rise/drop of the scores Initial [k] in “cause” of M08 • [sh] • <sil>
Methodology: tuning the algorithm • 2nd Rejection Rule -- A target token will be dropped If the points found in <sh> and <sil> scores are too far apart. (>20ms) • Partly solves the multiple release problem • The ideal way would to identify all candidate release points, and return the first one.
Methodology: tuning the algorithm • Applying 2nd Rejection Rule • Rejecting 3 cases inF07 • RMS(+5ms) = 3.22ms • Rejecting 20 cases in M08 • Only 2 problematic cases remain • RMS(+5ms) = 3.44ms Error analysis in M08 after 2nd rejection rule RMS(+5ms) = 9.26ms 3.44ms Compare: Optimal error is 2.5ms given the 5ms step size…
F07 M08 Methodology: tuning the algorithm Rejection rate: 3.03% Rejection rate: 15.05%
Methodology: testing the algorithm • Select a random sample of 50 tokens from all speakers • Hand-tag the release point • Use the current algorithm together with two rejection rules to find the estimated release. • Compare the hand-tagged point and the estimated one • 4 rejected by the 1st rule (3 were legitimate) • 3 rejected by the 2nd rule (2 were legitimate) • 43 accepted cases. RMS(error) <5ms
Calculate <silence> score and <sh> score Calculate the slope in <silence> score and <sh> score In a labeled voiceless stop span, (i)find the time point of largest positive slope in <sh> score, and store in p1; (ii)find the time point of smallest negative slope in <silence> score, and store in p2 p1 = null or p2 = null slope (p1)<0.02 and slope (p2)>0.04 |p1–p2|>=0.02 s return (p1+p2)/2+0.005 reject the case Methodology: summary Y N Y N Y N
Results: grand means • Rejection rates (2 rules combined) • Varies from 3.03% to 30.5% (mean = 13.3%,sd= 8.6%) across speakers. • VOT and closure duration
General Discussion • Echoing previous findings • Byrd (1993): Closure duration and VOT in read speech • Shattuck-Hufnagel &Veilleux (2007): 13% of missing landmarks in spontaneous speech
General Discussion • Future work • Fine-tune the 2nd rejection rule • Generalize the exemplar-based method for other automatic phonetic processing problem?
Acknowledgement • Anonymous speakers • Buckeye corpus developers • Prof. Keith Johnson • Members of the phonology lab in UC Berkeley Thank you! Any comments are welcome.
References • Byrd, D. (1993) 54,000 American stops. UCLA Working Papers in Phonetics. No 83, pp: 97-116. • Johnson, K. (2006) Acoustic attribute scoring: A preliminary report. • Liu, S. (1996) Landmark detection for distinctive feature-based speech recognition. J. Acoust. Soc. Amer. Vol 100, pp 3417-3430. • Niyogi, P., Ramesh, P. (1998) Incorporating voice onset time to improve letter recognition accuracies. Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP '98. Vol 1, pp: 13-16. • Pitt, M. et al. (2005) The Buckeye Corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication. Vol 45, pp: 90-95 • Shattuck-Hufnagel, S., Veilleux, N.M. (2007) Robustness of acoustic landmarks in spontaneously-spoken American English. Proceedings of International Congress of Phonetic Science 2007, Saarbrucken, August 2007. • Zue, V.W. (1976) Acoustic Characteristics of stop consonants: A controlled study. Sc. D. thesis. MIT, Cambridge, MA.