1 / 1

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR. Xin Lei, Mei-Yuh Hwang and Mari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195. Overview Motivation Tone has a crucial role in Mandarin speech

zuwena
Download Presentation

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Xin Lei, Mei-Yuh HwangandMari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195 Overview Motivation • Tone has a crucial role in Mandarin speech • Tone depends on the F0 contour of a time span much longer than a frame; frame-level F0 and its derivatives only capture part of the tonal patterns Our approach • Use MLP with longer input window to extract tone-related posteriors as a better model of tone • Incorporate tone-related posteriors in feature vector Tone-related Posterior Feature Extraction • Overall configuration Consider two types of features: • Tone posterior feature • MLP targets: five tones and a no-tone target • MLP input: 9-frame window of MFCC+F0 features • Toneme posterior feature • MLP targets: 62 speech tonemes plus one silence, one for laughter, one for all other noise • PCA the output to 25-dim Speech Recognition Results • Tone posterior feature results (non-crossword models) • Toneme posterior feature results (non-crossword models) • Cross-word system results Five Tones in Mandarin • Five tones in Mandarin • High-level (tone 1), high-rising (tone 2), low-dipping (tone 3), high-falling (tone 4) and neutral (tone 5) • Characterized by syllable-level F0 contour • Confusion caused by tones • 烟(y an1, cigarette), 盐(y an2, salt), 眼(y an3, eye), 厌(y an4, hate) • Common tone modeling techniques • Tonal acoustic units (such as tonal syllable or toneme) • Add pitch and derivatives to the feature vector Experiment Setup • Corpora • Mandarin conversational telephone speech data collected by HKUST • Training set: train04, 251 phone calls, 57.7 hours of speech • Testing set: dev04, 24 calls totaling 2 hours, manually segmented • Front-end feature • Standard MFCC/PLP features by SRI Decipher front-end • 3-dim pitch and its delta features by ESPS get_f0 • Pitch feature post-processed by SRI pitch tool graphtrack (reduces doubling and halving errors + smoothing) • Various MLP generated tone-related posterior features • Decoding setup • 7-Class unsupervised MLLR adaptation • Rescoring a lattice of word hypotheses with trigram Conclusions • Tone posterior features by themselves outperform F0 • Tone posteriors and the plain F0 features complement each other, though with some overlap • More improvement is achieved by using toneme posteriors than tone posteriors, since they carry both tone and segmental information • Combining tone-related posterior features with F0 shows 2-2.5% absolute improvement in CER MLP Training Results • MLP training with ICSI quicknet package • Features: 9-frame MFCC + F0 • Tone/toneme labels: from forced alignment • 10% of the data used for cross validation • Cross validation results on MLP training Tandem Features • Hermansky et al. proposed tandem approach • Use log MLP posterior outputs as the input features for Gaussian mixture models of a conventional speech recognizer • References • H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems”, ICASSP 2000 • N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition”, ICASSP 2004 Future Work • Context-dependent tone (“tri-tone”) classification • Tri-tone clustering For reference, frame accuracy for 46 English phones is around 67%

More Related