210 likes | 322 Views
An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training. Mr. Yik-Cheung Tam Dr. Brian Mak. Outline. Motivation Overview of MCE training Problem using N-best hypotheses Alternative:1-nearest hypothesis What? Why? How? Evaluation
E N D
An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak
Outline • Motivation • Overview of MCE training • Problem using N-best hypotheses • Alternative:1-nearest hypothesis • What? • Why? • How? • Evaluation • Conclusion
MCE Overview • The MCE loss function: • Distance measure: • G(X) may be computed using the N-best hypotheses. • l(.) = 0-1 soft error-counting function (Sigmoid) • Gradient descent method to obtain a better estimate.
Trainable region Problem Using N-best Hypotheses • When d(X) gets large enough, • It falls out of the steep trainableregion of Sigmoid.
What is 1-nearest Hypothesis? • d(1-nearest) <= d(1-best) • The idea can be generalized to N-nearest hypotheses.
Using 1-nearest Hypothesis • Keep the training data inside the steep trainable region. Trainable region
How to Find 1-nearest Hypothesis? • Method 1 (exact approach) • Stack-based N-best decoder Drawback: • N may be very large => memory problem • Need to limit the size of N. • Method 2 (approximated approach) • Modify the Viterbi algorithm with a special pruning scheme.
Approximated 1-nearest Hypothesis • Notation: • V(t+1, j) : accumulated score at time t+1 and state j • : transition probability from state i to j • : observation probability at time t+1 and state j • : accumulated score of the Viterbi path of the correct string at time t+1. • Beam(t+1) : beam width applied at time t+1
Approximated 1-nearest Hypothesis (.) • There exists some “nearest” path in the search space (shaded area).
Corpus: Aurora • Aurora • Noisy connected digits derived from TIDIGIT. • Multi-condition training: (Train on noisy condition) • {subway, babble, car, exhibition} x {clean, 20, 15, 10, 5} (5 noise levels) • 8440 training utterances. • Testing: (Test on matched noisy condition) • Same as above except with additional samples with 0 and –5 dB (7 noise levels) • 28,028 testing utterances.
System Configuration • Standard 39-dimension MFCC (cep + D + DD) • 11 Whole-word digit HMM (0-9, oh) • 16 states, 3 Gaussians per state • 3-state silence HMM, 6 Gaussians per state • 1-state short pause HMM tied to the 2nd state of the silence model. • Baum-Welch training to obtain the initial HMM. • Corrective MCE training on HMM parameters.
System Configuration (.) • Compare 3 kinds of competing hypotheses: • 1-best hypothesis • Exact 1-nearest hypothesis • Approx. 1-nearest hypothesis • Sigmoid parameters: • Various (control slope of Sigmoid) • Offset = 0
Experiment I: Effect of Sigmoid slope • Learning rate = 0.05, with different • 0.1 (best test performance) • 0.5 (steeper) • 0.02, 0.004 (more flat) Baseline: 12.71% 1-best: 11.01% Approx. 1-nearest: 10.71% Exact 1-nearest: 10.45%
Effective Amount of Training Data • Soft error < 0.95 is defined to be “effective”. • 1-nearest approach has more training data when the Sigmoid slope is relatively steep. Exact. 1-nearest (67%) Approx. 1-nearest (51%) 1-best (40%)
Experiment II: Compensation With More Training Iterations • With 100% effective training data, apply more training iterations: • = 0.004, learning rate = 0.05 • Result: Slow improvement compared to the best case. Exact 1-nearest with gamma = 0.1
Experiment II: Compensation Using a Larger Learning Rate • Use a larger learning rate (0.05 -> 1.25) • Fix = 0.004 (100% effective training data) • Result: 1-nearest approach is better than one-best approach after compensation.
Using a Larger Learning Rate (.) • Training performance: MCE loss versus # of training iterations. Approx. 1-nearest 1-best Exact. 1-nearest
Using a Larger Learning Rate (..) • Test performance: WER versus # of training iterations. 1-best (11.55%) Approx. 1-nearest (10.70%) Exact. 1-nearest (10.79%)
Conclusion • 1-best and 1-nearest methods were compared in MCE training. • Effect of Sigmoid slope. • Compensation on using a flat sigmoid. • 1-nearest method is better than 1-best approach. • More trainable data are available in the 1-nearest approach. • Approx. and exact 1-nearest methods yield comparable performance.