Regularized Adaptation: Theory, Algorithms and Applications

Regularized Adaptation: Theory, Algorithms and Applications Xiao Li Electrical Engineering Department University of Washington

Goal • When a target distribution pad(x,y) is close to but different from the training distribution ptr(x,y) • A classifier optimized for ptr(x,y) needs to be adapted for pad(x,y), using a small amount of labeled adaptation data

Roadmap • Introduction • Theoretical results • A Bayesian fidelity prior for adaptation • Generalization error bounds • Regularized adaptation algorithms • SVM and MLP adaptation • Experiments on vowel and object classification • Application to the Vocal Joystick • Conclusions and future work

Inductive Learning • Given • a set of m samples (xi, yi) ~p(x, y) • a decision function space F: X  {±1} • Goal • learn a decision function that minimizes the expected error • In practice • minimize the empirical error • while applying a regularization strategy to achieve good generalization performance

Why Is Regularization Helpful? • Learning theory says • Vapnik’s VC bound(a frequentist view): ΦVC dimension of F • Occam’s Razor bound(a Bayesian view): Φ π( f )for countable function space • “Accuracy-regularization” • We want to minimize the empirical error as well as the capacity • Support vector machines (a frequentist view) • Bayesian model selection (a Bayesian view)

Adaptive Learning (Adaptation) • Two related yet different distributions • Training • Target (test-time) • Given • An unadapted model • Adaptation data (labeled) • Goal • Learn an adapted model that is as close as possible to our desired model • Notes • Adaptation data Dm may be very limited • Original data for training f tr is not preserved for adaptation

Scenarios • Customization • Speech recognition: speaker adaptation • Handwriting recognition: writer adaptation • Language processing: domain adaptation • Evolutionary environment • Spam filter

Practical Work on Adaptation • Gaussian mixture models (GMMs) • MAP (Gauvain 94); MLLR (Leggetter 95) • Support vector machines (SVMs) • Boosting-like approach (Matic 93) • Weighted combination of old support vectors and adaptation data (Wu 04) • Multi-layer perceptrons (MLPs) • Shared “internal representation” in transfer learning (Baxter 95, Caruana 97, Stadermann 05) • Linear input network (Neto 95) • Conditional maximum entropy models • Gaussian prior (Chelba 04)

This Work Seeks Answers to … • A unified and principled approach to adaptation • Applicable to a variety of classifiers • Amenable to variations in the amount of adaptation data • Quantitative relationships between • The generalization error bound (or sample complexity bound) and • The divergence between training and target distributions

Bayesian Fidelity Prior • Adaptation objective • Remp( f ) – empirical error on the adaptation data • Pfid( f )– Bayesian “fidelity prior” • Fidelity prior • How likely a classifier is given a training distribution rather than a training set – key difference from Baxter 97 • Reflects the “fidelity” to the unadapted model • Related to the KL-divergence

Generative Models • Generative models • F – a space of generative models p(x, y | f) • Classification • Posterior • Assume f tr and f ad are the true models generating the training and target distributions respectively, i.e. • Note that this assumption is justified when the function space contains the true models and if we use the log likelihood loss standard prior, chosen before training

Fidelity Prior for Generative Models • Key result where β > 0 • Implication • Fidelity prior at the desired model We are more likely to learn our desired model using the fidelity prior than using the standard prior

Instantiations • To compute the fidelity prior • Assuming a “uniform” standard prior, this prior is determined by the KL-divergence • We can use an upper bound on the KL-divergence (hence a lower bound on the prior) • Gaussian models • The fidelity prior is a normal-Wishart distribution • Mixture models • An upper bound on the KL-divergence (using log sum inequality) • Hidden Markov models • An upper bound on the KL-divergence (Silva 06)

Discriminative Models • A unified view of SVMs, MLPs, CRFs and etc. • Affine classifiers in a transformed space f = ( w, b ) • Classification • Conditional distribution (for binary case)

Discriminative Models (cont.) • Conditional models • F – a space of conditional models p( y | x, f ) • Classification • Posterior • Assume f tr and f ad are the true models describing the training and target conditionaldistributions respectively, i.e.

Fidelity Prior for Conditional Models • Again a divergence where β > 0 • What if we do not know ptr(x, y) • We seek an upper bound on the KL-divergence and hence a lower bound on the prior • Key result where

Roadmap • Introduction • Theoretical results • A Bayesian fidelity prior for adaptation • Generalization error bounds • Regularized adaptation algorithms • SVM and MLP adaptation • Experiments on vowel and object classification • Learning and adaptation in the Vocal Joystick • Conclusions and future work

Occam’s Razor Bound for Adaptation • (McAllester 98) For a countable function space, for any prior π( f ) • Apply for adaptation …

Increasing KL Bound using standard prior Bounds using fidelity prior sample sizem

PAC-Bayesian Bounds for Adaptation • (McAllester 98) For both countable and uncountable function spaces, for any prior p( f ) and posterior q( f ) • Choice of prior p( f ) and posterior q( f ) • Use pfid( f )or its related forms as prior • Choose q( f )to have the same parametric form • Examples • Gaussian models • Linear classifier

Roadmap • Introduction • Theoretical results • A Bayesian fidelity prior for adaptation • Generalization error bounds • Regularized adaptation algorithms • SVM and MLP adaptation • Experiments on vowel and object classification • Application the Vocal Joystick • Conclusions and future work

Algorithms Derived from the Fidelity Prior • Generative Models • Relate to MAP adaptation • Conditional Models • In particular for log linear models We focus on SVMs and MLPs

Regularized SVM Adaptation • Optimization objective • Globally optimal solution • Regularized – fixing old support vectors and their coefficients • Extended regularized – update coefficients of old support vectors as well, with two separate constraints in the dual space

Algorithms in Comparison • Unadapted • Retrained • Use adaptation data only • Boosted (Matic 93) • Select adaptation data misclassified by the unadapted model • Combine with old support vectors • Bootstrapped (proposed in thesis) • Train a seed classifier using adaptation data only • Select old support vectors correctly classified by the seed classifier • Combine with adaptation data • Regularizedand Extended regularized

Regularized MLP Adaptation • Optimization objective for a multi-class, two-layer MLP • Wh2o andWi2h – the hidden-to-output and input-to-hidden layer weight matrix respectively • ||W|| = tr(WTW) • Remp( f ) – cross-entropy, corresponding to logistic loss • Locally optimal solution found using back-propagation

Algorithms in Comparison • Unadapted • Retrained • Start from randomly initialized weights and train with weight decay • Linear input network (Neto 95) • Add a linear transformation in the input space • Retrained speaker-independent (Neto 95) • Start from unadapted; train both layers • Retrained last layer (Baxter 95, Caruana 97, Stadermann 05) • Start from unadapted; only train the last layer • Retrained first layer (proposed in thesis) • Start from unadapted; only train the first layer • Regularized • Many algorithms above can be considered as special cases of regularized

Experimental Paradigm • Procedure • Train an unadapted model on training set • Adapt (with supervision) and evaluate via n-fold CV on test set • Select regularization coefficients on the dev set • Corpora • VJ vowel dataset (Kilanski 06) • NORB image dataset (LeCun 04)

VJ Vowel Dataset • Task • 8 vowel classes • Frame-level classification error rate • Speaker adaptation • Data allocation • Training set – 21 speakers, 420K samples For SVM, we randomly selected 80K samples for training • Test set – 10 speakers, 200 samples • Dev set – 4 speakers, 80 samples • Features • 182 dimensions – 7 frames of MFCC+delta features

SVM Adaptation • RBF kernel (std=10) optimized for training and fixed for adaptation • Mean and std. dev of frame-level error rates over 10 speakers redare the best and those not significantly different from the best at p<0.001 level

MLP Adaptation (I) • 50 hidden nodes • Mean and std. dev over 10 speakers

MLP Adaptation (II) • Varying number of vowel classes available in adaptation data

NORB Image Dataset • Task • 5 object classes • Classify each image • Lighting condition adaptation • Data allocation • Training set – 2700 samples • Test set – 2700 samples • Features • 32x32 raw images

SVM Adaptation • RBF kernel (std=500) optimized for training and fixed for adaptation • Mean and std. dev of classification error rates over 6 lighting conditions redare the best and those not significantly different from the best at p<0.001 level

MLP Adaptation • 30 hidden nodes • Mean and std. dev over 6 lighting conditions

Why the Vocal Joystick • Hands-free computer interfaces • Eye-gaze tracking – high cost • Head-mouse or chin-joystick – low bandwidth • Speech interfaces – inability for continuous control • The Vocal Joystick (Bilmes 05) • Can control both continuous and discrete tasks. • Utilizes intensity, vowel quality, pitch and discrete sound identity • The VJ-mouse control scheme Joint work with Graduate students: J. Malkin, S. Harada and K. Kilanski Faculty members: J. Bilmes, R. Wright, K. Kirchhoff, J. Landay, P. Dowden and H. Chizeck

The Pattern Recognition Module Graphical Model Pitch Tracking Pitch Regularized MLP Adaptation Two-Layer MLP Adaptive Filter Intensity Zero-cross Autocovaraince MFCCs Vowel Detection Vowel Classification Vowel posteriors Regularized GMM Adaptation Phoneme HMMs Discrete Sound Detection Discrete Sound Recognition Discrete sound ID

Vowel Classifier Adaptation • Classifier • Two-layer MLP with 4 or 8 outputs • Trained on the VJ-vowel dataset • Adaptation • Motivation: overlap of different vowel classes articulated from different speakers • Regularized adaptation for MLPs • Improvements over unadapted classifier (with 2 secs per vowel): 4-class 7.6%0.2%; 8-class: 32.0%8.2%

Discrete Sound Recognizer Adaptation • Recognizer • Phoneme HMMs • Trained on TIMIT • Rejection • “In-vocabulary”: unvoiced consonants • Garbage utterances: vowels, breathing, extraneous speech • Heuristics for rejection: duration, zero-crossing, pitch, posteriors • Adaptation • Motivation: consonant articulation mismatch between TIMIT data and the VJ application • Regularized adaptation for GMMs • Obtain better rejection thresholds

Conclusions and Future Work • Key contributions • A Bayesian fidelity prior and generalization error bounds • Regularized adaptation algorithms derived from the fidelity prior • Application to the Vocal Joystick • Future work • Theoretical work: adaptation error bounds for classifiers in an uncountable function space • Algorithmic work: kernel adaptation • The Vocal Joystick: discrete sound data collection and training a better baseline

Thank You

Regularized Adaptation: Theory, Algorithms and Applications