Maximum Likelihood Adaptation of Semi-Continuous HMMs

Maximum Likelihood Adaptation of Semi-Continuous HMMs by Latent Variable Decomposition of State Distributions LTI Student Research Symposium 2004 Antoine Raux Work done in collaboration with Rita Singh

Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation

HMMs for Speech Recognition • Generative probabilistic model of speech • States represent sub-phonemic units • In general, 2 types of parameters: • Temporal aspect: transition probabilities • Spectral aspect: output distributions (means, variances, mixing weights of mixtures of Gaussians) • 2 broad types of structure: • Continuous Density • Semi-Continuous

Continuous Density HMMs N(mi1,vi1) N(mi2,vi2) N(mi3,vi3) N(mj1,vj1) N(mj2,vj2) N(mj3,vj3) N(mk1,vk1) N(mk2,vk2) N(mk3,vk3) wi1=P(Ci1|S=Si) wi2 wi3 wk2 wk1 wj2 wj1 wk3 wj3 Si Sj Sk

Semi-Continuous HMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk

SCHMMs vs CDHMMs • Less powerful (i.e. continuous are better with large amounts of training data) • BUT faster to compute (fewer Gaussian computations) and train well on less data • Training of codebook and mixture weights can be decoupled

Acoustic Adaptation • Both CDHMMs and SCHMMs need a large amount of data for training • Such amounts are not always available for some conditions (domain, speakers, environment) • Acoustic Adaptation: modify models trained on a large amount of data to match different conditions using a small amount of data

Model-based (ML) Adaptation • Tie the parameters of different states so that all states can be adapted with little data • Typical method: Maximum Likelihood Linear Regression (MLLR) used to adapt means and variances of CDHMMs

Adapting Mixture Weights • Problem: MLLR does not work for mixture weights of SCHMMs • Weights are not evenly distributed (because their sum always equals 1) • Standard clustering algorithms ineffective • Problem: tie states with similar weight distributions

Parallel with Information Retrieval • Typical problem in Information Retrieval: identify similar documents • Documents can be represented as distributions over the vocabulary:tie documents with similar word distributions

Word Document Representation Word1 Word2 Word3 Word4 Word5 Word6 Word7 wi2 wi6 wi7 wi1 wi4 wi5 wi3 Di Dj Dk … …

Problems with Word Document Representation • Word distribution for a document is sparse • Ambiguous words, synonyms… • Cannot reliably compare distributions to compare documents

PLSA for IR • Solution proposed by Hofmann (1999):Probabilistic Latent Semantic Analysis • Express documents and words as distributions over a latent variable (topic?) • Latent variable takes a small number of values compared to words/documents • Similar to standard LSA but guarantees proper probability distributions

PLSA for IR Word1 Word2 Word3 Word4 Word5 Word6 Word7 wz11=P(Word1|Z=Z1) Z1 Z2 Z3 Z4 wdi1=P(Z1|D=Di) Di Dj Dk … …

PLSA Decomposition • Decompose the joint probability: Independence Assumption ! • Pd(d,w) lies on a sub-space of the probability simplex (PLS-Space) • Estimate parameters using EM algorithm so as to minimize the KL-divergence between P(d,w) and Pd(d,w)

Back to Speech Recognition… N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk

PLSA for SCHMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wz11=P(C1|Z=Z1) Z1 Z2 Z3 Z4 wsi1=P(Z1|S=Si) Si Sj Sk

Adaptation through PLSA Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM (Baum-Welch) Retrain SCHMM (Baum-Welch) Decompose Weights using PLSA Recompose Weights Decompose Weights using PLSA P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II

Evaluation Experiment • Training data/Original models • 50 hours of calls to the Communicator system • Mostly native speakers • 4000 states, 256 Gaussian components • Adaptation data • 3 hours of calls to the Let’s Go system • Non-native speakers • Evaluation data • 449 utterances (20 min) from calls to Let’s Go • Non-native speakers

Evaluation results

Transitions I Means I Variances I Weights I Evaluation results

Transitions II Means II Variances II Weights II Evaluation results

Transitions II Means II Variances II Weights III Best Result: Readapt everything!! Evaluation results

Reestimating all three distributions: P(Z), P(C|Z) and P(S|Z) Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM Retrain SCHMM Decompose Weights Recompose Weights Decompose Weights P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II

Conclusion • PLSA ties states of SCHMMs by introducing a latent variable • PLSA adaptation improves accuracy • Best method is equivalent to smoothing the retrained weight distributions by projection on the PLS-space • Future direction: directly learn the PLSA parameters in the Baum-Welch training

Thank you… Questions?

Maximum Likelihood Adaptation of Semi-Continuous HMMs