10 likes | 158 Views
Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA hongbing.hu@binghamton.edu, zahorian@binghamton.edu. PCA. Network Outputs. Dimensionality
E N D
Dimensionality Reduction Methods for HMM Phonetic RecognitionHongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USAhongbing.hu@binghamton.edu, zahorian@binghamton.edu PCA Network Outputs Dimensionality Reduced Features Original Features Dimensionality Reduced Outputs PCA • NLDA Reduction • Overview • Nonlinear Discriminant Analysis (NLDA) • A multilayer neural network performs a nonlinear feature transformation of the input speech features • Phone models for transformed features using HMMs with each state modeled with a GMM (Gaussian Mixture Model) • PCA performs a Karhunen-Loeve (KL) transform for reducing the correlation of the network outputs • NLDA2 Method • Reduced Features • Use outputs of the network middle hidden layer • The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality • The linear PCA is used only for feature decorrelation • Nonlinearity • All nonlinear layers are used in both the feature transformation and network training • NLDA1 Method • Dimensionality Reduced Features • Obtained at the output layer of the neural network • Feature dimensionality is further reduced by PCA • Node Nonlinearity (Activation Function) • In the feature transformation, a linear function for the output layer and a Sigmoid nonlinear function for the other layers • In the NLDA1 training, all layers are nonlinear Dimensionality Reduced Features • Introduction • Accurate Automatic Speech Recognition (ASR) • Highly discriminative features • Incorporate nonlinear frequency scales and time dependency • Low dimensionality feature spaces • Efficient recognition models (HMMs & Neural Networks) • Neural Network Based Dimensionality Reduction • Neural Networks (NNs) used to represent complex data while preserving the variability and discriminability of the original data • Combine with a HMM recognizer to form a hybrid NN/HMM recognition model Dimensionality Reduction Decorrelation Reduction & Decorrelation Dimensionality Reduction • For 3-state models, train using “Don’t Cares” • For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3 • For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3 • For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2 • Experimental Setup • TIMIT Database (‘SI’ and ‘SX” only) • 48 phoneme set mapped down from 62 phoneme set • Training Data: 3696 sentences (460 speakers) • Testing Data: 1344 sentences (168 speakers) • DCTC/DCSC Features • A total of 78 features (13 DCTCs x 6 DCSCs) were computed • 10 ms frames with 2 ms frame spacing and 8 ms block spacing with 1s block length • HMM • 3-state Left-to-right Markov models with no skip • 48 monophone HMMs were created using HTK (ver 3.4) • Language model: Phone bigram information of the training data • Neural Networks in NLDA • 3 hidden-layers: 500-36-500 nodes • Input layer: 78 nodes corresponding to the feature dimensionality • Output Layer: 48 nodes for the phoneme targets, or 144 nodes for the state level targets • Training Neural Networks • Phone Level Targets • Each NN output correspond to a specific phone • Straightforward to implement, using phonetically labeled training database • But why should NN output be forced to the same value for the entire phone • State Level Targets • Each NN output correspond to a single state of a phone HMM • But how to determine state boundaries • Estimate using percentage of total length • Use initial training iteration, then Viterbi alignment • Experimental Results • Control Experiment • Compare the original DCTC-DSCS with the PCA and LDA reduced features • Use various numbers of mixtures in HMMs • Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions) • NLDA Experiment • Evaluate NLDA1 and NLDA2 with or without PCA • 48-dimensional phoneme level targets used • The features reduced to 36 dimensions • Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing • State Level Target Exp 1 • Compare the state level targets with and without “Don’t cares” • 144-dimensional state level targets used • State boundaries obtained using the fixed state length method (3 states: 1:4:1) • Network training: 8x106 weight updates • Accuracies using the state level targets with and without “Don’t cares” • State Level Target Exp 2 • Use a fixed length ration and the Viterbi alignment for the state targets • State level targets with “Don’t cares” used • Targets obtained using a fixed length ratio (3 states: 1:4:1) and the Viterbi alignment • Network training: 4x107 weight updates • Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment • Literature Comparison • Recognition accuracy based on TIMIT • Conclusions • Very high recognition accuracies were obtained using the outputs of network middle layer as in NLDA2 • The NLDA methods are able to produce a low-dimensional effective representation of speech features • The NLDA2 reduced features achieved a substantial improvement versus the original features • The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs • The state level targets with “Don’t cares” result in higher accuracies • The middle layer outputs of a network results in more effective features in a reduced space • The accuracies improved about 2% with PCA