1 / 10

Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms

Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms. Stavros Tsakalidis and Spyros Matsoukas. Motivation. Goal: Develop a training procedure that overcomes SAT limitations SAT limitations: A point estimate of the transform for each speaker is found

noura
Download Presentation

Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas

  2. Motivation • Goal: Develop a training procedure that overcomes SAT limitations • SAT limitations: • A point estimate of the transform for each speaker is found • Speaker clusters remain fixed throughout training • SAT modeling accuracy depends on the selection of the clusters • Transforms are not integrated in the training/decoding procedure • The transforms in the training-set are not used in decoding • A new set of ML transforms is estimated for the test set speakers • Potential mismatch when using discriminatively trained SAT

  3. Bayesian Speaker Adaptive Training (BSAT) • Use a discrete distribution rather than a point estimate • Decoding criterion: • Decoding: Weighted sum of likelihoods under each transform in the mixture • Acoustic model training: Estimate set of transforms, transform priors and HMM parameters

  4. BSAT Details • Transforms are shared across all utterances • Transforms are not speaker-dependent as in SAT • No need to group the utterances into speaker-clusters • Avoid locally optimal solutions related to speaker-clustering • BSAT can be trained under Maximum Likelihood or discriminative criteria • Discriminatively trained transforms used directly in decoding • BSAT treats the mixture of transforms similar to gaussian mixture training • Transforms and gaussians are built incrementally • Transform splitting is an open question; not as easy as gaussian splitting

  5. Transform Splitting • Challenge: Find meaningful perturbations of a transform to obtain initial estimates • Workaround: Cluster the utterances and estimate initial transforms for each cluster • Bias clustering • Idea: Group utterances that have similar transforms • Challenge: Estimation of a full matrix is impractical • Solution: Estimate a bias term for each utterance • Cluster biases and estimate an initial transform for each cluster • Issue: Biases have almost zero variance due to mean normalization of features • Feature clustering • Use a K-Means procedure to cluster the utterances • Each object in K-Means corresponds to an utterance • Estimate an initial transform for each cluster

  6. BSAT Estimation Procedure Start with a single identity CMLLR transform Split transform(s) Update transforms and transform priors 3 Iterations Update gaussian parameters

  7. Experiment Setup • Training/Test set • Training: 150 hrs of Arabic BN data • Test: bnat05 test set • Acoustic model • Seed BSAT estimation from a well-trained SI model • 12 mixtures per state, 1762 states, 24K total Gaussians • Decoding procedure • Find the 1-best hypothesis from the baseline unadapted decoding • Select the transform that gives the highest likelihood on the 1-best hypothesis • Rescore lattice created by the baseline unadapted decoding

  8. BSAT Training *Numbers in boxes indicate number of transforms • Both clustering procedures yield comparable likelihood

  9. BSAT Decoding • Both BSAT systems yield comparable WER • BSAT: 1% absolute gain using only 16 transforms • SI: 0.9% absolute gain by doubling the number of parameters • WER reaches a plateau by increasing the number of transforms in the mixture *Numbers in boxes indicate number of transforms

  10. Conclusions • Integrated the transforms into the training/decoding procedure • Discriminatively trained transforms can be used in decoding • Preliminarily results show that BSAT improves SI model performance with as few as 16 transforms • Future work • Improve transform splitting • Apply transform splitting concurrently with gaussian splitting • Use top-N transforms in decoding • Use MLLR transforms rather than CMLLR transforms • Use discriminative estimation criteria rather than ML

More Related