NGASR 2011 暑期講習會講者：林奇嶽

Detection of Burst Onset Using Random Forest Technique and Its Application to Voice Onset Time Estimate基於隨機森林法之爆發起始偵測及其在嗓音起始時間預估之應用 NGASR 2011 暑期講習會講者：林奇嶽

Outline • Burst Onset Detection • Burst onset • Feature representation • Random forest (RF) • Experimental results • Voice Onset Time Estimate • Voice onset time (VOT) • Proposed HMM+RF system • Experimental results • Conclusion

Section I Burst Onset Detection

Burst onset Burst onset Burst OnsetFundamental phonetics • A stop or an affricate consonant consists of following speech events: • Closure: air flow is completely blocked with certain articulators in the vocal tract. (voice bar or silence) • Release: the blockage is suddenly released, resulting in a puff of air rushing out of the mouth. • Aspiration (stop) or Fricative (affricate) • The most salient event is the onset of the release, which is commonly termed burst onset.

Burst OnsetFundamental phonetics • Burst onset could be the shortest event in speech signal. • A sudden increase of all-band energy exhibits a stripe pattern in a Fourier-based spectrogram. Such an all-band energy dies out immediately. don’t carry

Burst OnsetFundamental phonetics • To detect burst onsets in continuous speech, we focus on a small spectro-temporal patch containing a “closure-burst transition”. don’t carry

Feature representationTwo-dimensional Cepstral Coefficient • Two-dimensional cepstral coefficients (TDCC) are used to encode such a “closure-burst transition”. • In deriving TDCC for each spectro-temporal patch, we perform two discrete cosine transforms to compact the transition information into a small set of coefficients. • 1st DCT: cepstral analysis (along frequency axis) • 2nd DCT: dynamic behavior of the coefficients from the first DCT (along time axis) • Between the two DCTs is a cepstral mean subtraction (CMS)

Feature representationTwo-dimensional Cepstral Coefficient • Similarity of dynamic feature derivation between the conventional regression formula and TDCC. Coefficient value Coefficient value Relative frame distance Relative frame distance Derivative coeff. Accelerative coeff.

Coefficients are extracted in a row-major fashion Feature representationDerive TDCC from a spectro-temporal patch • Each frame in a patch is an LPC-derived spectrum. • Frame length: 10 ms (160 samples) • Frame shift: 2 ms (32 samples) • LP analysis with an order of 24. The LPC-derived spectrum is obtained with a 512-point DFT. Extract 55coefficients 55x1 vector

Closure-burst transition patterns for detecting burst onsets Feature representation Waveform and Feature Plane

Random forestFundamental • A random forest (RF) consists of following techniques • An ensemble of classifiers • RF is an ensemble of tree classifiers • Bootstrapping and aggregating (bagging) • Generate multiple training sets for tree classifiers • Final decision is made by a plurality vote (majority vote) • Random subspace • Introduce randomness during node splitting.

Random forestFundamental • RF construction procedure • Bootstrapping training set for each tree classifier • Growing one tree and adding it to the forest. The step is terminated when a specified number of trees is reached. • While searching for an optimal cut, only considering a few dimensions. Repeat this whenever a node needs a split. • Growing the tree to its maximal size without any posterior pruning. (highest purity) • During testing, each tree in the forest hypothesizes a class for the input vector. Then a final decision is made by a plurality vote.

Random forestFundamental D-dimensional vector Randomly select d dimensions to search for an optimal split, where d~sqrt(D) Bootstrapping training data Each node achieves highest purity. There is no posterior pruning. Each tree classifier is fully grown and then is added to the ensemble. Repeat the procedure several times to construct more tree classifiers

Random forestBroad phonetic category of manners • Articulatory manners • stop, affricate, fricative, nasal, semivowel, vowel, non- speech • “Stop” is further divided into • Voiced-stop burst • Voiceless-stop burst • Stop-aspiration • “burst”: voiced-stop burst, voiceless-stop burst“non-burst”: all other classes

Random forestImbalanced training data • The problem of imbalanced training data • The numbers of training vectors from different manners are highly imbalanced. • #Vowel >> #Fricative > … > #Stop (#Burst) • Conventional bootstrap causes problems. • Most of training vectors are selected from the majority classes such as “Vowel” and “Fricative”. • The target class “Burst”, however, may not be sampled sufficiently. Thus a resulting tree classifier lacks discriminative power to detect burst onsets.

Random forestAsymmetric Bootstrap • Generate balanced training data burst fricative vowel BootstrappedTraining Data BootstrappedTraining Data • The procedure repeats several times • Over-sampling the “burst” class • Down-sampling the other classes

Random forestDetect burst onsets • For each input vector , the forest votes for its class

0 3.68 0 3.90 0 Random forestDetect burst onsets frame

Random forestDetect burst onsets frame 0 3.68 0 3.90 0

Experimental Results Speech materials • TIMIT corpus (English read speech) • Microphone speech, 16 kHz sampling rate, 16-bit PCM format. • 630 speakers, including 438 males and 192 females • 8 different dialect regions in the US (DR1~DR8) • Training set :462 speakers (326M, 136F)Testing set: 168 speakers (112M, 56F) • Each speaker spoke 10 sentences, • 2 SA sentences: fixed contexts • 5 SX sentences: phonetically compact • 3 SI sentences: phonetically diverse

Experimental Results Speech materials • TIMIT corpus • Training data are from four speakers in DR1 • Training data for “burst” class are exclusively from stops. • Testing data are all utterances from TIMIT TEST set. 6991 stops 631 affricates

Experimental Results RF-based burst onset detector • Random forest settings • Training dataset: 4 speakers from TIMIT DR1 • Broad phonetic category of articulatory manners • nine classes • Apply asymmetric bootstrap to balance the training data • 56-dim feature vector (D=56), including 55 TDCCs and 1 average log-energy of the patch. • The detector consists of 30 trees • The dimension of random subspace during the node splitting is d=8 • No posterior tree pruning

Experimental Results Precision of detection • Median: 3.1 ms Interdecile Range: 12.6 ms • Precision: Voiceless > Voiced

Dental fricative // Dental fricative // Experimental ResultsSources of false alarm • Most onsets of dental fricatives are detected as having burst onsets, and they are hard to be rejected. • Other sources are fricatives and pause segments.

Experimental Resultsmissed detection rate • The missed detection rate increases as the confidence threshold increases. • Stops (5.1%  6.5%) Affricates (13.6%  15.8%)

Experimental Results Comparison of different RF settings • D: # of feature dimension • d: # of randomly selected dimensions in node splitting.

Experimental Results Comparison of various learning machines • Accuracy: RF  SVM > GMM • Execution time: RF  GMM >> SVM • SVM kernel: RBF  LIN

Experimental Results Comparison of various amount of training data • Training data are from dialect region one (DR1) • SVM-RBF starts to surpass RF as more data are included. • SVM-RBF takes far more time in training and testing.

Summary • The proposed RF-based detector is able to efficiently detect burst onsets in continuous speech. • The detector only needs few training data. • Experimental results demonstrate its applicability. • The proposed asymmetric bootstrap technique can resolve the problem of imbalanced training data.

Section II Voice Onset Time Estimate

Voice Onset Time • Voice onset time (VOT) was proposed in 1960s. It was expected to effectively distinguish between English /b, d, g/ and /p, t, k/. • Another cues are “voicing”, “articulatory force”, and “aspiration.” • VOT is defined as a time difference between burst onset and voicing onset.

Voice Onset TimeTwo examples of VOT borrow tim

Voice Onset Time • VOT can be classified into several categories • Voicing Lead: VOT is negative-valued • Voicing Coincide: VOT is about zero • Voicing Lag: VOT is positive-valued • Distributions of VOT are different from language to language. • Two-modal: English, Spanish, Mandarin, Dutch • Three-modal: Korean, Thai • Four-modal: Hindi

Voice Onset TimeExisting automatic methods to estimate VOT • Automatic VOT estimate methods include • Forced alignment performed by an HMM phone recognizer (HMM-FA) • Pros: efficient, suitable for large corpus • Cons: aligned boundaries normally do not meet the onsets • Onset detector for burst and voicing onsets (OD) • Pros: estimated onset locations are more accurate • Cons: only suitable for isolated words • Combination of the two (HMM-FA+OD) • Have the pros of the two previous methods at the same time

Proposed HMM+RF SystemFlowchart of the system

Proposed HMM+RF SystemSystem overview • The proposed system consists of two parts: • Forced alignment based on HMM • Roughly locate stop consonants in continuous speech. • The aligned boundaries typically do not align with true onset locations. • Onset Detection based on random forest • For each aligned stop consonant, the detector searches its neighborhood for its burst and voicing onsets.

Proposed HMM+RF SystemHMM-based phone recognizer • HMM-based phone recognizer • Training dataset: the whole TIMIT training set • 48 context-independent English phones • HMM topology: three-state left-to-right HMM, each state has eight Gaussian components. • ML training + EM algorithm • Execute five times of embedded training every time the number of Gaussian components are doubled. • 13-dim MFCC + 1-dim log-energy plus their derivative and accelerative coefficients.

Proposed HMM+RF SystemRF-based onset detector • Random forest based onset detector • Training dataset: 4 speakers from TIMIT training set • Broad phonetic category of articulatory manners • Burst  burst onset • Vocalic  voicing onset • 56-dim TDCC vector • The detector consists of 30 trees • The dimension of random subspace during the node split is 8 • No posterior tree pruning • Apply asymmetric bootstrap to balance the training data from the broad phonetic categories.

Proposed HMM+RF SystemMore details about the onset detector • Burst onset detection • The procedure is the same as described in Section I. • Voicing onset detection • The first frame of a detected ‘vocalic’ segment following a detected burst onset is regarded as the voicing onset.

Proposed HMM+RF SystemMore details about the onset detector • Voicing onset adjustment procedure • (a) Aspiration or release portion: is large • (b) Vocalic portion: is small • in the region between (a) and (b) will be large

Proposed HMM+RF SystemMore details about the onset detector • An example of voicing onset adjustment

Experimental ResultsEvaluation dataset • Subset of TIMIT testing set • 3,784 stop consonants in 968 distinct words. • 2,344 word-initial stop consonants and 1,440 word-medial stop consonants. • The selected stop consonants are left-context independent, but right-context dependent.

Experimental ResultsEvaluation dataset • The list of eligible succeeding vowels in the experiment. Ht. (Vowel Height): Low, Mid-Low, Mid-High, High Bk. (Vowel Backness): Front, Central, Back

Experimental ResultsPerformance Evaluation • Four systems to be compared • HMM-FA-PL • HMMForced Alignment at Phone Level • HMM-FA-PL+OD • HMM-FA-PL with Onset Detection • HMM-FA-SL • HMMForced Alignment at State Level • HMM-FA-SL+OD • HMM-FA-SL with Onset Detection

Experimental ResultsPerformance Evaluation • Absolute temporal deviation between an estimated VOT and its true value. • The deviations are presented in terms of cumulative relative frequency distributions • Four tolerances:  5 ms,  10 ms,  15 ms, and  20 ms

Experimental ResultsVOT estimates in voiced and voiceless stops • Estimating VOTs of voiced stops with HMM-FA-PL are very poor. • HMM topology limitation • HMM-FA-SL significantly improves the estimates • The effect of an additional onset detection is remarkable.

With additional OD, the estimates of burst and voicing onsets are both enhanced HMM-FA-SL corrects estimate deviation of burst onset in HMM-FA-PL Experimental Results3D-histograms of estimate deviations

Experimental ResultsPerformance Comparison Absolute deviation of estimation * RS: Reassigned Spectrum ** Stouten & Van hamme (2009) employed RS technique to estimate VOT.

Experimental ResultsPerformance in Detail • VOT estimates of voiced velar stop /g/ are less accurately estimated than other five stops. • On average, VOTs of velar stops (/g/, /k/) are less accurately estimated. • VOTs of word-medial voiced stops are less accurately estimated than their word-initial counterparts. • Caused by failed detection of burst onset. • Contrarily, the estimations for voiceless stops in word-medial and word-initial positions are statistically the same.

NGASR 2011 暑期講習會講者：林奇嶽