Speaker Adaptation for Vowel Classification

Speaker Adaptation for Vowel Classification Xiao Li Electrical Engineering Dept.

Outline • Introduction • Background on statistical classifiers • Proposed Adaptation strategies • Experiments and results • Conclusion

/ae/ /aa/ /iy/ /uh/ Application • “Vocal Joystick” (VJ) • Human-computer interaction for people with motor-impairments • Acoustic parameters – energy, pitch, vowel quality, discrete sound • Vowel classification • Vowels /ae/ (bat); /aa/ (bought); /uh/ (boot); /iy/ (beat) • Control motion direction

Features • Formants • Peaks in spectrum • Low dimension (F1, F2, F3, F4 + dynamics) • Hard to estimate • Mel-frequency cesptral coefficients (MFCC) • Cosine transform of log spectrum • High dimension (26 including deltas) • Easy to compute • Our choice – MFCCs

User-Independent vs. User–Dependent • User-independent models • NOT optimized for a specific speaker • Easy to get a large train set • User-dependent models • Optimized for a specific speaker • Difficult to get a large train set

Adaptation • What is adaptation? • Adapting user-independent models to a specific user, using a small set of user-dependent data • Adaptation methodology for vowel classification • Train speaker-independent vowel models • Ask a speaker to articulate a few seconds of vowels for each class • Adapt the classifier on this small amount of speaker-dependent data

Gaussian mixture models (GMM) • Generative models • Training objective – maximum likelihood (EM) • For training samples O1:T • Classification • Compute the likelihood scores for each class, and choose the one with the highest likelihood • Limitation • A class model is trained using only the data in this class • Constraints on the discriminant functions

Neural Networks (NN) • Three layer perceptrons • # input nodes – feature dimension x window size • # hidden nodes – empirically chosen • # output nodes – # of classes • Training objective • Minimum relative entropy • Classification • Compare the output values • Advantages • Discriminative training • Nonlinearity • Features taken from multiple frames Target yk

NN-SVM Hybrid Classifier • Idea – replace the hidden-to-output layer of the NN by linear-kernel SVMs • Training objective • Maximum margin • theoretically guaranteed on test error bound • Classification • Compare the output values of binary classifiers • Advantages • Compared to pure NN: optimal solution in the last layer • Compared to pure SVM: efficiently handling features from multiple frames; no need to choose kernel

MLLR for GMM Adaptation • Maximum Likelihood Linear Regression • Apply a linear transformation on the Gaussian mean • Same transformation for the mixture of Gaussians in the same class • The covariance matrix can be adapted in a similar fashion, but less effective

MLLR Formulas • Objective – maximum likelihood • For adaptation samples O1:T • First-order derivative vanishes • The transform W is obtained by solving a linear equation

NN Adaptation • Idea – fix the nonlinear mapping and adapt the last layer (linear classifier) • Adaptation objective – minimum relative entropy • Start from the original weights • Gradient descent formulas

NN-SVM Classifier Adaptation • Idea – *again* fix the nonlinear mapping and adapt the last layer • Adaptation objective – maximum margin • Adaptation procedure • Keep the support vectors of the training data • Combine these support vectors with the adaptation data • Retrain the linear-kernel SVMs for the last layer

Database • Pure vowel recordings with different energy and pitch • Duration – long short • Energy – loud, normal, quiet • Pitch – rising, level, falling • Statistics • Train set -- 10 speakers • Test set – 5 speakers • 4 or 8 or 9 vowel classes • 18 utterances (2000 samples) for each vowel and each speaker

Adaptation and Evaluation Set • 6-fold cross-validation for each speaker • 18 utterances are divided into 6 subsets • We adapt on each subset and evaluate on the rest • We get 6 accuracy scores for each vowel, and compute the mean and deviation • Average over 5 speakers

Speaker-Independent Classifiers • The individual scores for different speakers vary a lot • If NN window = 1, the performance is similar to GMM

Adapted Classifiers

Conclusion • For speaker-independent models, the NN classifier (with multiple frame input) works well • For speaker-adapted models, the NN classifier is effective, and NN-SVM so far gets the best performance

Speaker Adaptation for Vowel Classification