90 likes | 226 Views
LECTURE 26: UNSUPERVISED DISCRIMINATIVE ADAPTATION. Objectives: Adaptation Challenges Historical Perspective Discriminative Linear Transforms Discriminative Mapping Transforms Optimization Criteria Experimental Results Resources: KY : Unsupervised Training.
E N D
LECTURE 26: UNSUPERVISED DISCRIMINATIVE ADAPTATION • Objectives:Adaptation ChallengesHistorical PerspectiveDiscriminative Linear TransformsDiscriminative Mapping TransformsOptimization CriteriaExperimental Results • Resources:KY: Unsupervised Training • URL: .../publications/courses/ece_8423/lectures/current/lecture_26.ppt • MP3: .../publications/courses/ece_8423/lectures/current/lecture_26.mp3
Motivation • Today’s lecture will review an interesting paper recently presented at the 2008 International Conference on Acoustics, Speech and Signal Processing: • K. Yu, M.J.F. Gales and P.C. Woodland, “Unsupervised Discriminative Adaptation Using Discriminative Mapping,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 4273-4276, Las Vegas, Nevada, U.S.A., March 2008. • Though this paper is focused on speech processing, the techniques described can be applied to any data set that demonstrates similar characteristics of speaker-dependent (data common to an experimental condition) and speaker-independent (multiple independent samples of the experimental conditions). • The key contributions of this paper are with respect to the ability to estimate a discriminative transform that gives improved performance on unsupervised data. That is a problem common to most pattern recognition applications. • The core idea is to separate discrimination and ML-based adaptation.
Historical Perspective on Adaptation • Linear transforms (e.g.,MLLR) are the most common form of adaptation for applications involving limited data. • ML must be used for unsupervised adaptation; discriminative models can be estimated on training data (where transcriptions are known) but are rarely appropriate for real applications. • Performance gains for discriminative models in unsupervised adaptation have been limited: • One approach is to use the “recognizer” to generate the most likely transcription of the data and to use that transcription for supervision. • If the accuracy of the recognizer is high, this can work well. • However, discriminative training/adaptation is very sensitive to transcription errors, so in practice, when performance is low, and you need adaptation to improve performance, this method does not work well. • Confidence measures and lattices can be used to minimize the effects of transcription errors, but this also reduces the impact of adaptation: • The data you most need to adapt on is the data with the lowest confidence or poorest recognition performance. • Hence, discriminative adaptation has had limited success and has provided only marginal improvements in practice over ML.
Discriminative Mapping Transform • A number of methods to combine ML adaptation with discriminatively-trained models have been explored, including “speaker-adaptive” transforms and feature-based adaptation (to be discussed later). These approaches attempt to combine a good “speaker-independent” discriminatively trained model with unsupervised speaker-dependent transformations. • A criterion-mapping function is defined as an attempt to map a speaker-independent transform from one condition to another via some form of mapping (linear or nonlinear). • In this work, the goal is to map a “speaker-specific” ML-estimated linear transform to be more similar to a Minimum Phone Error (MPE) discriminatively trained transform. • Recall, MPE is one of three discriminative training techniques in which the phone, or sub-sequence errors, are minimized. • A linear transform will be estimated and used for this mapping. This transform is referred to as a discriminative mapping transform(DMT). • In theory this approach can be applied to any form of linear transform, either mean, covariance or features. • Here, the focus will be on adaptation of the means.
Linear Transforms • MLLR was an approach to estimate a linear transform of the mean: • The parameters of this transform were estimated using ML. • We have explored ways to estimate this transform discriminatively: • We have explored an MMI approach to estimating the transform and briefly mentioned MCE and MPE approaches that are based on loss functions: • To estimate the loss function, we need to know the transcription, H(s). This approach has worked well for supervised adaptation, but not as well for unsupervised adaptation. • We have seen that MMI trained models can be combined with ML trained adaptation to produce reasonable gains in performance. • MMI can be applied to features, models (e.g., Gaussian mixtures), or most other forms of transformations.
Discriminative Mapping Transforms • Criterion mapping functions (CMFs) use the same approach, but introduces a speaker independent transformation of the speaker-specific transforms: • This CMF attempts to map ML-trained transforms into discriminative transforms(a transform on top of another transform). • The mapping function is trained in a speaker-independent manner so that it can be applied to any speaker. • One form of this transform that has been investigated is a linear transform: • where vec() maps the transformation matrix to a vector (to reduce parameters). • If we restrict Hdl to a block diagonal matrix, then this transformation can be simplified to: • For mean adaptation, this further simplifies to: • The speaker independent transform can be estimated in a manner similar to our previous DLT approach.
Optimization Criterion • The training criterion for the DLT can be expressed as: • This is computed on the training data, and is computed using the ML-derived transform for each speaker, and then summed over all speakers. The advantage of this is that we have much more supervised data to use to estimate the discriminative transforms. • So, in training this transform, this approach essentially adjusts W to minimize the loss function using ML parameter estimates derived from the training data. In recognition mode, this transform is used in conjunction with the ML-adapted estimates from the unsupervised adaptation. • Hence, we are attempting to decouple adaptation and discrimination, hoping that the discriminative aspects are independent of the change in environment. • Instead of training a single transformation over all data, multiple transformation matrices can be estimated using a process similar to regression classes in MLLR. For example, the mean vectors can be clustered, and a transformation matrix can be built for each cluster. Alternately, this can be clustered by decision trees, phonetic characteristics, or both.
Experimentation • General framework: 5,000 speakers and 256 hours of training data; 144 speakers and 6 hours of adaptation/test data. • Performance as a function of the number of transforms: • Performance as a function of the source of the supervision hypothesis:
Summary • Introduced a hybrid approach to adaptation that attempts to separate discrimination from ML adaptation. • Described a new framework for robust discriminative unsupervised adaptation that uses a speaker-independent criterion mapping function (CMF) estimated during training to map the ML estimated speaker-dependent transforms to a more discriminative form. • The transform is not highly sensitive to the adaptation hypotheses, which is a major issue with standard discriminative estimation of linear transforms. • A simple initial implementation of the CMF based on linear transforms is described. This is referred to as a discriminative mapping transform (DMT). • Future work: A number of alternative transforms and applications will be investigated • Next: unsupervised adaptation using large amounts of data.