10 likes | 142 Views
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR. Xin Lei, Mei-Yuh Hwang and Mari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Overview Motivation Tone has a crucial role in Mandarin speech
E N D
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Xin Lei, Mei-Yuh HwangandMari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195 Overview Motivation • Tone has a crucial role in Mandarin speech • Tone depends on the F0 contour of a time span much longer than a frame; frame-level F0 and its derivatives only capture part of the tonal patterns Our approach • Use MLP with longer input window to extract tone-related posteriors as a better model of tone • Incorporate tone-related posteriors in feature vector Tone-related Posterior Feature Extraction • Overall configuration Consider two types of features: • Tone posterior feature • MLP targets: five tones and a no-tone target • MLP input: 9-frame window of MFCC+F0 features • Toneme posterior feature • MLP targets: 62 speech tonemes plus one silence, one for laughter, one for all other noise • PCA the output to 25-dim Speech Recognition Results • Tone posterior feature results (non-crossword models) • Toneme posterior feature results (non-crossword models) • Cross-word system results Five Tones in Mandarin • Five tones in Mandarin • High-level (tone 1), high-rising (tone 2), low-dipping (tone 3), high-falling (tone 4) and neutral (tone 5) • Characterized by syllable-level F0 contour • Confusion caused by tones • 烟(y an1, cigarette), 盐(y an2, salt), 眼(y an3, eye), 厌(y an4, hate) • Common tone modeling techniques • Tonal acoustic units (such as tonal syllable or toneme) • Add pitch and derivatives to the feature vector Experiment Setup • Corpora • Mandarin conversational telephone speech data collected by HKUST • Training set: train04, 251 phone calls, 57.7 hours of speech • Testing set: dev04, 24 calls totaling 2 hours, manually segmented • Front-end feature • Standard MFCC/PLP features by SRI Decipher front-end • 3-dim pitch and its delta features by ESPS get_f0 • Pitch feature post-processed by SRI pitch tool graphtrack (reduces doubling and halving errors + smoothing) • Various MLP generated tone-related posterior features • Decoding setup • 7-Class unsupervised MLLR adaptation • Rescoring a lattice of word hypotheses with trigram Conclusions • Tone posterior features by themselves outperform F0 • Tone posteriors and the plain F0 features complement each other, though with some overlap • More improvement is achieved by using toneme posteriors than tone posteriors, since they carry both tone and segmental information • Combining tone-related posterior features with F0 shows 2-2.5% absolute improvement in CER MLP Training Results • MLP training with ICSI quicknet package • Features: 9-frame MFCC + F0 • Tone/toneme labels: from forced alignment • 10% of the data used for cross validation • Cross validation results on MLP training Tandem Features • Hermansky et al. proposed tandem approach • Use log MLP posterior outputs as the input features for Gaussian mixture models of a conventional speech recognizer • References • H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems”, ICASSP 2000 • N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition”, ICASSP 2004 Future Work • Context-dependent tone (“tri-tone”) classification • Tri-tone clustering For reference, frame accuracy for 46 English phones is around 67%