300 likes | 314 Views
This paper discusses the adaptation of semi-continuous Hidden Markov Models (HMMs) using maximum likelihood methods. It also explores the application of Probabilistic Latent Semantic Analysis (PLSA) for adaptation in both speech recognition and information retrieval systems. The evaluation results show the effectiveness of the proposed adaptation techniques.
E N D
Maximum Likelihood Adaptation of Semi-Continuous HMMs by Latent Variable Decomposition of State Distributions LTI Student Research Symposium 2004 Antoine Raux Work done in collaboration with Rita Singh
Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation
Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation
HMMs for Speech Recognition • Generative probabilistic model of speech • States represent sub-phonemic units • In general, 2 types of parameters: • Temporal aspect: transition probabilities • Spectral aspect: output distributions (means, variances, mixing weights of mixtures of Gaussians) • 2 broad types of structure: • Continuous Density • Semi-Continuous
Continuous Density HMMs N(mi1,vi1) N(mi2,vi2) N(mi3,vi3) N(mj1,vj1) N(mj2,vj2) N(mj3,vj3) N(mk1,vk1) N(mk2,vk2) N(mk3,vk3) wi1=P(Ci1|S=Si) wi2 wi3 wk2 wk1 wj2 wj1 wk3 wj3 Si Sj Sk
Semi-Continuous HMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk
SCHMMs vs CDHMMs • Less powerful (i.e. continuous are better with large amounts of training data) • BUT faster to compute (fewer Gaussian computations) and train well on less data • Training of codebook and mixture weights can be decoupled
Acoustic Adaptation • Both CDHMMs and SCHMMs need a large amount of data for training • Such amounts are not always available for some conditions (domain, speakers, environment) • Acoustic Adaptation: modify models trained on a large amount of data to match different conditions using a small amount of data
Model-based (ML) Adaptation • Tie the parameters of different states so that all states can be adapted with little data • Typical method: Maximum Likelihood Linear Regression (MLLR) used to adapt means and variances of CDHMMs
Adapting Mixture Weights • Problem: MLLR does not work for mixture weights of SCHMMs • Weights are not evenly distributed (because their sum always equals 1) • Standard clustering algorithms ineffective • Problem: tie states with similar weight distributions
Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation
Parallel with Information Retrieval • Typical problem in Information Retrieval: identify similar documents • Documents can be represented as distributions over the vocabulary:tie documents with similar word distributions
Word Document Representation Word1 Word2 Word3 Word4 Word5 Word6 Word7 wi2 wi6 wi7 wi1 wi4 wi5 wi3 Di Dj Dk … …
Problems with Word Document Representation • Word distribution for a document is sparse • Ambiguous words, synonyms… • Cannot reliably compare distributions to compare documents
PLSA for IR • Solution proposed by Hofmann (1999):Probabilistic Latent Semantic Analysis • Express documents and words as distributions over a latent variable (topic?) • Latent variable takes a small number of values compared to words/documents • Similar to standard LSA but guarantees proper probability distributions
PLSA for IR Word1 Word2 Word3 Word4 Word5 Word6 Word7 wz11=P(Word1|Z=Z1) Z1 Z2 Z3 Z4 wdi1=P(Z1|D=Di) Di Dj Dk … …
PLSA Decomposition • Decompose the joint probability: Independence Assumption ! • Pd(d,w) lies on a sub-space of the probability simplex (PLS-Space) • Estimate parameters using EM algorithm so as to minimize the KL-divergence between P(d,w) and Pd(d,w)
Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation
Back to Speech Recognition… N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wi2 wi6 wi7 wi1 wi4 wi5 wi3 Si Sj Sk
PLSA for SCHMMs N(m1,v1) N(m2,v2) N(m3,v3) N(m4,v4) N(m5,v5) N(m6,v6) N(m7,v7) wz11=P(C1|Z=Z1) Z1 Z2 Z3 Z4 wsi1=P(Z1|S=Si) Si Sj Sk
Adaptation through PLSA Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM (Baum-Welch) Retrain SCHMM (Baum-Welch) Decompose Weights using PLSA Recompose Weights Decompose Weights using PLSA P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II
Outline • CDHMMs, SCHMMs, and Adaptation • A Little Visit to IR • PLSA Adaptation Scheme • Evaluation
Evaluation Experiment • Training data/Original models • 50 hours of calls to the Communicator system • Mostly native speakers • 4000 states, 256 Gaussian components • Adaptation data • 3 hours of calls to the Let’s Go system • Non-native speakers • Evaluation data • 449 utterances (20 min) from calls to Let’s Go • Non-native speakers
Transitions I Means I Variances I Weights I Evaluation results
Transitions II Means II Variances II Weights II Evaluation results
Transitions II Means II Variances II Weights III Best Result: Readapt everything!! Evaluation results
Reestimating all three distributions: P(Z), P(C|Z) and P(S|Z) Large Database Small Database Transitions I Means I Variances I Weights I Transitions II Means II Variances II Weights II Transitions II Means II Variances II Weights III Train SCHMM Retrain SCHMM Decompose Weights Recompose Weights Decompose Weights P(Z) I P(C|Z) I P(S|Z) I P(Z) I/II P(C|Z) I/II P(S|Z) I/II
Conclusion • PLSA ties states of SCHMMs by introducing a latent variable • PLSA adaptation improves accuracy • Best method is equivalent to smoothing the retrained weight distributions by projection on the PLS-space • Future direction: directly learn the PLSA parameters in the Baum-Welch training
Thank you… Questions?