Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms

Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas

Motivation • Goal: Develop a training procedure that overcomes SAT limitations • SAT limitations: • A point estimate of the transform for each speaker is found • Speaker clusters remain fixed throughout training • SAT modeling accuracy depends on the selection of the clusters • Transforms are not integrated in the training/decoding procedure • The transforms in the training-set are not used in decoding • A new set of ML transforms is estimated for the test set speakers • Potential mismatch when using discriminatively trained SAT

Bayesian Speaker Adaptive Training (BSAT) • Use a discrete distribution rather than a point estimate • Decoding criterion: • Decoding: Weighted sum of likelihoods under each transform in the mixture • Acoustic model training: Estimate set of transforms, transform priors and HMM parameters

BSAT Details • Transforms are shared across all utterances • Transforms are not speaker-dependent as in SAT • No need to group the utterances into speaker-clusters • Avoid locally optimal solutions related to speaker-clustering • BSAT can be trained under Maximum Likelihood or discriminative criteria • Discriminatively trained transforms used directly in decoding • BSAT treats the mixture of transforms similar to gaussian mixture training • Transforms and gaussians are built incrementally • Transform splitting is an open question; not as easy as gaussian splitting

Transform Splitting • Challenge: Find meaningful perturbations of a transform to obtain initial estimates • Workaround: Cluster the utterances and estimate initial transforms for each cluster • Bias clustering • Idea: Group utterances that have similar transforms • Challenge: Estimation of a full matrix is impractical • Solution: Estimate a bias term for each utterance • Cluster biases and estimate an initial transform for each cluster • Issue: Biases have almost zero variance due to mean normalization of features • Feature clustering • Use a K-Means procedure to cluster the utterances • Each object in K-Means corresponds to an utterance • Estimate an initial transform for each cluster

BSAT Estimation Procedure Start with a single identity CMLLR transform Split transform(s) Update transforms and transform priors 3 Iterations Update gaussian parameters

Experiment Setup • Training/Test set • Training: 150 hrs of Arabic BN data • Test: bnat05 test set • Acoustic model • Seed BSAT estimation from a well-trained SI model • 12 mixtures per state, 1762 states, 24K total Gaussians • Decoding procedure • Find the 1-best hypothesis from the baseline unadapted decoding • Select the transform that gives the highest likelihood on the 1-best hypothesis • Rescore lattice created by the baseline unadapted decoding

BSAT Training *Numbers in boxes indicate number of transforms • Both clustering procedures yield comparable likelihood

BSAT Decoding • Both BSAT systems yield comparable WER • BSAT: 1% absolute gain using only 16 transforms • SI: 0.9% absolute gain by doubling the number of parameters • WER reaches a plateau by increasing the number of transforms in the mixture *Numbers in boxes indicate number of transforms

Conclusions • Integrated the transforms into the training/decoding procedure • Discriminatively trained transforms can be used in decoding • Preliminarily results show that BSAT improves SI model performance with as few as 16 transforms • Future work • Improve transform splitting • Apply transform splitting concurrently with gaussian splitting • Use top-N transforms in decoding • Use MLLR transforms rather than CMLLR transforms • Use discriminative estimation criteria rather than ML

Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms