Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Xin Lei, Mei-Yuh HwangandMari Ostendorf SSLI-LAB, Department of Electrical Engineering, University of Washington, Seattle, WA 98195 Overview Motivation • Tone has a crucial role in Mandarin speech • Tone depends on the F0 contour of a time span much longer than a frame; frame-level F0 and its derivatives only capture part of the tonal patterns Our approach • Use MLP with longer input window to extract tone-related posteriors as a better model of tone • Incorporate tone-related posteriors in feature vector Tone-related Posterior Feature Extraction • Overall configuration Consider two types of features: • Tone posterior feature • MLP targets: five tones and a no-tone target • MLP input: 9-frame window of MFCC+F0 features • Toneme posterior feature • MLP targets: 62 speech tonemes plus one silence, one for laughter, one for all other noise • PCA the output to 25-dim Speech Recognition Results • Tone posterior feature results (non-crossword models) • Toneme posterior feature results (non-crossword models) • Cross-word system results Five Tones in Mandarin • Five tones in Mandarin • High-level (tone 1), high-rising (tone 2), low-dipping (tone 3), high-falling (tone 4) and neutral (tone 5) • Characterized by syllable-level F0 contour • Confusion caused by tones • 烟(y an1, cigarette), 盐(y an2, salt), 眼(y an3, eye), 厌(y an4, hate) • Common tone modeling techniques • Tonal acoustic units (such as tonal syllable or toneme) • Add pitch and derivatives to the feature vector Experiment Setup • Corpora • Mandarin conversational telephone speech data collected by HKUST • Training set: train04, 251 phone calls, 57.7 hours of speech • Testing set: dev04, 24 calls totaling 2 hours, manually segmented • Front-end feature • Standard MFCC/PLP features by SRI Decipher front-end • 3-dim pitch and its delta features by ESPS get_f0 • Pitch feature post-processed by SRI pitch tool graphtrack (reduces doubling and halving errors + smoothing) • Various MLP generated tone-related posterior features • Decoding setup • 7-Class unsupervised MLLR adaptation • Rescoring a lattice of word hypotheses with trigram Conclusions • Tone posterior features by themselves outperform F0 • Tone posteriors and the plain F0 features complement each other, though with some overlap • More improvement is achieved by using toneme posteriors than tone posteriors, since they carry both tone and segmental information • Combining tone-related posterior features with F0 shows 2-2.5% absolute improvement in CER MLP Training Results • MLP training with ICSI quicknet package • Features: 9-frame MFCC + F0 • Tone/toneme labels: from forced alignment • 10% of the data used for cross validation • Cross validation results on MLP training Tandem Features • Hermansky et al. proposed tandem approach • Use log MLP posterior outputs as the input features for Gaussian mixture models of a conventional speech recognizer • References • H. Hermansky et al., “Tandem connectionist feature extraction for conventional HMM systems”, ICASSP 2000 • N. Morgan et al., “TRAPping conversational speech: Extending TRAP/Tandem approaches to conversational telephone speech recognition”, ICASSP 2004 Future Work • Context-dependent tone (“tri-tone”) classification • Tri-tone clustering For reference, frame accuracy for 46 English phones is around 67%

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR

Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR

Presentation Transcript

ASR

Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation

Mandarin Tone Recognition using Affine-Invariant Prosodic Features and Tone Posteriorgram

Main Vowel Domain Tone Modeling With Lexical and Prosodic Analysis for Mandarin ASR

Mandarin in Y7

Incorporating Game Theory in Feature Selection for Text Categorization

Underspecified feature models for pronunciation variation in ASR

Feature Extraction for ASR

Microsatellites in the Mlp genome

Application Equipment for ASR

ASR

ASR

Why a Rising Tone is Falling in Mandarin Sentences

ASR IN CONCRETE

FUL : Incorporating phonological theory into ASR Aditi lahiri (Prof in Oxford)

Feature Extraction for ASR

Overview of Feature Structure Representation

Why a Rising Tone is Falling in Mandarin Sentences

Mandarin in Y7