Discriminative Training in Speech Processing

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA

Content • Bayes Decision Theory and DiscriminativeTraining • Minimum Classification Error(MCE) Training • Generalized Probabilistic Descent(GPD) algorithm • MCE Training versus Maximum Mutual Information(MMI) Training • Discriminative Training for Speech Recognition • Discriminative Training for Speaker Verification

Discriminative Training for Feature Extraction • Discriminative Training of Language Models • Discriminative Training for Speech/Music Classification • Conclusions

Bayes Decision Theory and Discriminative Training • Main assumption of Bayes decision theory: a joint probability functions are known, where X is an observation and are class labels. Decision cost function: (1)

(2) (3) (4) (5) MAP decision: (6)

Why MAP decision is not optimal for real speech data? • Probability distribution of speech data is usually uknown and a postulated HMM approximation for this distribution doesn’t provide a MAP optimal solution. • Even if HMM was correct distribution for speech, the lack of training data often doesn’t allow to accurately model probability distribution of competing speech classes near their boundaries.

Class I real distribution Class II real distribution Class I postulated distribution Class II postulated distribution

Discriminative Functions Discriminative functions: - classification error

Minimum Classification Error(MCE) Training - a classification error for X

1 0.5 0

Generalized Probabilistic Descent(GPD) Algorithm positive definite matrix a set of HMMs at the step t of GPD algorithm a speech sample(sentence, word, phone,frame) at the step t of GPD algorithm Example: Gaussian mean correction by GPD algorithm a mean for the HMM i, state j, Gaussian mixture k, dimension at the step t of GPD algorithm

MCE Training versus Maximum Mutual Information Training

Maximization of mutual information corresponds to minimization of special type of classification error. Unlike general procedure of MCE maximization of mutual information doesn’t provide higher correction values to the parameters at the class boundaries. Minimization of classification error provides a better class separation at the class boundaries due to a form of the sigmoid function

Discriminative Training for Speech Recognition • Discriminative training is based on comparison the likelihood • scores estimated for single speech units(phones, words). • Examples: • E-set vocabulary recognition(W.Chou, 1992) • Speaker independent recognition(100 speakers) • ML training – 76% phone recognition accuracy • MCE/GPD training – 88% phone recognition accuracy. • Broadcast news phone string recognition(Korkmazsky, 2003) • ML training – 61.93% phone recognition accuracy • MCE/GPD training – 65.11% phone recognition accuracy

2. Discriminative training is based on comparison the likelihood scores estimated for the strings of speech units(sentences) a true word string, one of the N alternative word strings • Examples: • Connected digit strings of uknown length recognition(Wu Chou,1993) • ML training - 1.4% string error rate • MCE/GPD training – 0.95% string error rate • Wireless noisy data digit strings recognition(Korkmazsky, 1997) • ML training – 2.6% word error rate • MCE/GPD training –1.4% word error rate • Generalized HMM MCE/GPD training – 1.0% word error rate

Discriminative Training for Speaker Verification a true talker and impostor HMMs then X represents a true talker then X represents an impostor a verification threshold

E[A]-an expectation for A Example: a speaker verification for database consisiting of 43 speakers, each having 5 training sentences(Korkmazsky,1996) ML training – 4.40% equal error rate MCE/GPD training – 2.50% equal error rate

Discriminative Training for Feature Extraction Feature Extractor Acoustic Model Discriminative Training Language Model

Examples: • Discriminative filter bank design(Biem, Katagiri, 1996): • Central filter bank frequencies were adjusted by MCE/GPD training. • First, 128 FFT spectral coefficients were converted to 16 Mel • spectrum coefficients by using some convential frequency scale. • The models for 5 japanese vowels were represented by the frequency • templates. Recognition accuracy in this experiment was 80.91%. • After MCE/GPD adjustment of the central band frequencies accuracy • increased to 82.45%. • Discriminative training of the lifter coefficients(Biem, Juang,1997): • Lifter coefficients weight quefrency values after cosine transform. • Lifter weights were trained by adjusting neural network coefficients • using MCE criterion. Error rate for 5 japanese vowels was reduced • from 14.5% to 11.3%.

Discriminative Training of Language Models (Zhen Chen, Kai-Fu Lee(1999), Jeff Kuo, Hui Jiang(2002)) correct word sequence

Discriminative correction of the bigram probabilities for all word pairs : appears a number of times a word pair in the word sequence

DARPA Communicator Project(air travel reservation system) Baseline language model: 900 unigrams and 41K bigrams Baseline LM perplexity =34, after DT perplexity = 35 Baseline LM After DT Word error rate 19.7% 17.5% Training sentences 19.7% 19.0% Test sentences Sentence error rate 30.9% 26.4% Training sentences Test sentences 30.9% 29.0%

Discriminative Training for Speech/Music Classification (Korkmazsky, 2003) Speech class: speech, speech&music in the background speech&song in the background Nonspeech class: music, song, noise(aspiration, cough, laugh)

block classification error frame classification error a total number frames in the block a set of 6 GMMs Frame labeling accuracy for ML trained GMMs – 90.5% Frame labeling accuracy for MCE trained GMMs – 92.7%

Conclusions • Maximum likelihood training often does not provide optimal speech classification because real distribution of speech data is unknown. • Discriminative training usually improves speech classification over ML training. • Discriminative training may provide comparable to ML training recognition performance by using a a smaller number of model parameters. • Many new methods of classification(like SVM or boosting) are discriminative ones.

Discriminative Training in Speech Processing

Discriminative Training in Speech Processing

Presentation Transcript

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Discriminative Training Approaches for Continuous Speech Recognition

Large scale discriminative training for speech recognition

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing