500 likes | 511 Views
This presentation introduces a sparse modeling approach for speech recognition using kernel machines. The use of support vector machines (SVMs) and relevance vector machines (RVMs) in large vocabulary speech recognition systems is discussed, highlighting their superior performance compared to Gaussian mixture-based HMMs. The integration of these models into acoustic models is explored, with an emphasis on training algorithms for large corpora. The presenter's research on kernel machines as replacements for Gaussian distributions in hidden Markov acoustic models is also presented.
E N D
A Sparse Modeling Approach to Speech Recognition Using Kernel Machines Jon Hamaker hamaker@isip.msstate.edu Institute for Signal and Information Processing Mississippi State University
Abstract Statistical techniques based on Hidden Markov models (HMMs) with Gaussian emission densities have dominated the signal processing and pattern recognition literature for the past 20 years. However, HMMs suffer from an inability to learn discriminative information and are prone to over-fitting and over-parameterization. Recent work in machine learning has focused on models, such as the support vector machine (SVM), that automatically control generalization and parameterization as part of the overall optimization process. SVMs have been shown to provide significant improvements in performance on small pattern recognition tasks compared to a number of conventional approaches. SVMs, however, require ad hoc (and unreliable) methods to couple it to probabilistic learning machines. Probabilistic Bayesian learning machines, such as the relevance vector machine (RVM), are fairly new approaches that attempt to overcome the deficiencies of SVMs by explicitly accounting for sparsity and statistics in their formulation. In this presentation, we describe both of these modeling approaches in brief. We then describe our work to integrate these as acoustic models in large vocabulary speech recognition systems. Particular attention is given to algorithms for training these learning machines on large corpora. In each case, we find that both SVM and RVM-based systems perform better than Gaussian mixture-based HMMs in open-loop recognition. We further show that the RVM-based solution performs on par with the SVM system using an order of magnitude fewer parameters. We conclude with a discussion of the remaining hurdles for providing this technology in a form amenable to current state-of-the-art recognizers.
Bio Jon Hamaker is a Ph.D. candidate in the Department of Electrical and Computer Engineering at Mississippi State University under the supervision of Dr. Joe Picone. He has been a senior member of the Institute for Signal and Information Processing (ISIP) at MSU since 1996. Mr. Hamaker's research work has revolved around automatic structural analysis and optimization methods for acoustic modeling in speech recognition systems. His most recent work has been in the application of kernel machines as replacements for the underlying Gaussian distribution in hidden Markov acoustic models. His dissertation work compares the popular support vector machine with the relatively new relevance vector machine in the context of a speech recognition system. Mr. Hamaker has co-authored 4 journal papers (2 under review), 22 conference papers, and 3 invited presentations during his graduate studies at MS State (http://www.isip.msstate.edu/publications). He also spent two summers as an intern at Microsoft in the recognition engine group.
Outline • The acoustic modeling problem for speech • Current state-of-the-art • Discriminative approaches • Structural optimization and Occam’s Razor • Support vector classifiers • Relevance vector classifiers • Coupling vector machines to ASR systems • Scaling relevance vector methods to “real” problems • Extensions of this work
Input Speech Acoustic Front-End Focus of Work Statistical Acoustic Models p(A/W) Language Model p(W) Search Recognized Utterance ASR Problem • Front-end maintains information important for modeling in a reduced parameter set • Language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams) • Search engine uses knowledge sources and models to chooses amongst competing hypotheses
Acoustic Confusability Requires reasoning under uncertainty! • Regions of overlap represent classification error • Reduce overlap by introducing acoustic and linguistic context Comparison of “aa” in “lOck” and “iy” in “bEAt” for SWB
Probabilistic Formulation • To deal with the uncertainty, we typically formulate speech as a probabilistic problem: • Objective: Minimize the word error rate by maximizing P(W|A) • Approach: Maximize P(A|W) during training • Components: • P(A|W): Acoustic Model • P(W): Language Model • P(A): Acoustic probability (ignored during maximization)
THREE TWO FIVE EIGHT s0 s1 s4 s2 s3 Acoustic Modeling - HMMs • HMMs model temporal variation in the transition probabilities of the state machine • GMM emission densities are used to account for variations in speaker, accent, and pronunciation • Sharing model parameters is a common strategy to reduce complexity
Maximum Likelihood Training • Data-driven modeling supervised only from a word-level transcription • Approach: maximum likelihood estimation • The EM algorithm is used to improve our estimates: • Guaranteed convergence to local maximum • No guard against overfitting! • Computationally efficient training algorithms (Forward-Backward) have been crucial • Decision trees are used to optimize sharing parameters, minimize system complexity and integrate additional linguistic knowledge
ML Convergence does not translate to optimal classification Error from incorrect modeling assumptions Finding the optimal decision boundary requires only one parameter! Drawbacks of Current Approach
Data not separable by a hyperplane – nonlinear classifier is needed Gaussian MLE models tend toward the center of mass – overtraining leads to poor generalization Drawbacks of Current Approach
Acoustic Modeling • Acoustic Models Must: • Model the temporal progression of the speech • Model the characteristics of the sub-word units • We would also like our models to: • Optimally trade-off discrimination and representation • Incorporate Bayesian statistics (priors) • Make efficient use of parameters (sparsity) • Produce confidence measures of their predictions for higher-level decision processes
Paradigm Shift - Discriminative Modeling • Discriminative Training (Maximum Mutual Information Estimation) • Essential Idea: Maximize • Maximize numerator (ML term), minimize denominator (discriminative term) • Discriminative Modeling (e.g. ANN Hybrids – Bourlard and Morgan)
Research Focus • Our Research: replace the Gaussian likelihood computation with a machine that incorporates notions of • Discrimination • Bayesian statistics (prior information) • Confidence • Sparsity • All while maintaining computational efficiency
Shortcomings: Prone to overfitting: require cross-validation to determine when to stop training. Need methods to automatically penalize overfitting No substantial recognition improvements over HMM/GMM Architecture: ANN provides flexible, discriminative classifiers for emission probabilities that avoid HMM independence assumptions (can use wider acoustic context) Trained using Viterbi iterative training (hard decision rule) or can be trained to learn Baum-Welch targets (soft decision rule) P(c1|o) … P(cn|o) ….. ... ……………….. Input Feature Vector ANN Hybrids
Open-Loop Error Error Optimum Training Set Error Model Complexity Structural Optimization • Structural optimization often guided by an Occam’s Razor approach • Trading goodness of fit and model complexity • Examples: MDL, BIC, AIC, Structural Risk Minimization, Automatic Relevance Determination
The VC dimension is a measure of the complexity of the learning machine Higher VC dimension gives a looser bound on the actual risk – thus penalizing a more complex model (Vapnik) Expected Risk: Not possible to estimate P(x,y) Empirical Risk: Related by the VC dimension, h: Approach: choose the machine that gives the least upper bound on the actual risk Structural Risk Minimization Expected risk bound on the expected risk optimum VC confidence empirical risk VC dimension
Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally The data points that define the boundary are called support vectors Optimization: Separable Data Hyperplane: Constraints: Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors Final classifier: C2 H2 CO C1 class 1 H1 w origin optimal classifier class 2 Support Vector Machines
SVMs as Nonlinear Classifiers • Data for practical applications typically not separable using a hyperplane in the original input feature space • Transform data to higher dimension where hyperplane classifier is sufficient to model decision surface • Kernels used for this transformation • Final classifier:
SVMs for Non-Separable Data • No hyperplane could achieve zero empirical risk (in any dimension space!) • Recall the SRM Principle: trade-off empirical risk and model complexity • Relax our optimization constraint to allow for errors on the training set: • A new parameter, C, must be estimated to optimally control the trade-off between training set errors and model complexity
SVM Drawbacks • Uses a binary (yes/no) decision rule • Generates a distance from the hyperplane, but this distance is often not a good measure of our “confidence” in the classification • Can produce a “probability” as a function of the distance (e.g. using sigmoid fits), but they are inadequate • Number of support vectors grows linearly with the size of the data set • Requires the estimation of trade-off parameter, C, via held-out sets
Evidence Maximization • Build a fully specified probabilistic model – incorporate prior information/beliefs as well as a notion of confidence in predictions • MacKay posed a special form for regularization in neural networks – sparsity • Evidence maximization: evaluate candidate models based on their “evidence”, P(D|Hi) • Structural optimization by maximizing the evidence across all candidate models! • Steeped in Gaussian approximations
Penalty that measures how well our posterior model fits our prior assumptions: We can use set the prior in favor of sparse, smooth models! Evidence approximation: Likelihood of data given best fit parameter set: P(w|D,Hi) D w P(w|Hi) w s w Evidence Framework
Relevance Vector Machines • A kernel-based learning machine • Incorporates an automatic relevance determination (ARD) prior over each weight (MacKay) • A flat (non-informative) prior over acompletes the Bayesian specification
Relevance Vector Machines • The goal in training becomes finding: • Estimation of the “sparsity” parameters is inherent in the optimization – no need for a held-out set! • A closed-form solution to this maximization problem is not available. Rather, we iteratively reestimate
Laplace’s Method • Fix a and estimate w (e.g. gradient descent) • Use the Hessian to approximate the covariance of a Gaussian posterior of the weights centered at • With and as the mean and covariance, respectively, of the Gaussian approximation, we find by finding • Method is O(N2) in memory and O(N3) in time
RVM Data: Class labels (0,1) Goal: Learn posterior, P(t=1|x) Structural Optimization: Hyperprior distribution encourages sparsity Training: iterative – O(N3) SVM Data: Class labels (-1,1) Goal: Find optimal decision surface under constraints Structural Optimization: Trade-off parameter that must be estimated Training: Quadratic – O(N2) RVMs Compared to SVMs
Experimental Progression • Proof of concept on speech classification data • Coupling classifiers to ASR system • Reduced-set tests on Alphadigits task • Algorithms for scaling up RVM classifiers • Further tests on Alphadigits task (still not the full training set though!) • New work aiming at larger data sets and HMM decoupling
Vowel Classification • Deterding Vowel Data: 11 vowels spoken in “h*d” context; 10 log area parameters; 528 train, 462 SI test
k frames hh aw aa r y uw region 1 0.3*k frames region 2 0.4*k frames region 3 0.3*k frames mean region 1 mean region 2 mean region 3 Coupling to ASR • Data size: • 30 million frames of data in training set • Solution: Segmental phone models • Source for Segmental Data: • Solution: Use HMM system in bootstrap procedure • Could also build a segment-based decoder • Probabilistic decoder coupling: • SVMs: Sigmoid-fit posterior • RVMs: naturally probabilistic
Coupling to ASR System Features (Mel-Cepstra) HMM RECOGNITION Segment Information SEGMENTAL CONVERTER N-best List Segmental Features HYBRID DECODER Hypothesis
Alphadigit Recognition • OGI Alphadigits: continuous, telephone bandwidth letters and numbers (“A19B4E”) • Reduced training set size for RVM comparison: 2000 training segments per phone model • Could not, at this point, run larger sets efficiently • 3329 utterances using 10-best lists generated by the HMM decoder • SVM and RVM system architecture are nearly identical: RBF kernels with gamma = 0.5 • SVM requires the sigmoid posterior estimate to produce likelihoods – sigmoid parameters estimated from large held-out set
SVM Alphadigit Recognition • HMM system is cross-word state-tied triphones with 16 mixtures of Gaussian models • SVM system has monophone models with segmental features • System combination experiment yields another 1% reduction in error
SVM/RVM Alphadigit Comparison • RVMs yield a large reduction in the parameter count while attaining superior performance • Computational costs mainly in training for RVMs but is still prohibitive for larger sets
Scaling Up • Central to RVM training is the inversion of an MxM Hessian matrix: an O(N3) operation initially • Solutions: • Constructive Approach: Start with an empty model and iteratively add candidate parameters. M is typically much smaller than N • Divide and Conquer Approach: Divide complete problem into set of sub-problems. Iteratively refine the candidate parameter set according to sub-problem solution. M is user-defined
Constructive Approach • Tipping and Faul (MSR-Cambridge) • Define • has a unique solution with respect to • The results give a set of rules for adding vectors to the model, removing vectors from the model or updating parameters in the model
Prune all parameters; While not converged For each parameter: If parameter is pruned: checkAddRule Else: checkPruneRule checkUpdateRule End Update Model End Begin with all weights set to zero and iteratively construct an optimal model without evaluating the full NxN inverse Formed for RVM regression – can have oscillatory behavior for classification Rule subroutines require the full design matrix (NxN) storage requirement Constructive Approach Algorithm
TRAIN TRAIN TRAIN TRAIN Iterative Reduction Algorithm • O(M3) in run-time and O(MxN) in memory. M is a user-defined parameter • Assumes that if P(wk=0|wI,J,D) is 1 then P(wk=0|w,D) is also 1! Optimality? RVs Subset 0 Candidate Pool Subset J RVs Iteration I Iteration I+1
Alphadigit Recognition • Data increased to 10000 training vectors • Reduction method has been trained up to 100k vectors (on toy task). Not possible for Constructive method
Summary • First to apply kernel machines as acoustic models • Comparison of two machines that apply structural optimization to learning: SVM and RVM • Performance exceeds that of HMM but with quite a bit of HMM interaction • Algorithms for increased data sizes are key
Decoupling the HMM • Still want to use segmental data (data size) • Want the kernel machine acoustic model to determine an optimal segmentation though • Need a new decoder • Hypothesize each phone for each possible segment • Pruning is a huge issue • Stack decoder is beneficial • Status: In development
Candidate Pool Subset 0 RVs TRAIN TRAIN RVs Subset 1 Improved Iterative Algorithm • Same principle of operation • One pass over the data – much faster! • Status: Equivalent performance on all benchmarks – running on Alphadigits now
Active Learning for RVMs • Idea: Given the current model, iteratively chooses a subset of points from the full training set that will improve the system performance • Problem #1: “Performance” typically is defined as classifier error rate (e.g. boosting). What about the posterior estimate accuracy? • Problem #2: For kernel machines, an added training point can: • Assist in bettering the model performance • Become part of the model itself! How do we determine which points should be added? • Look to work in Gaussian Processes (Lawrence, Seeger, Herbrich, 2003)
Extensions • Not ready for prime time as an acoustic model • How else might we use the same techniques for speech? • Online Speech/Noise Classification? • Requires adaptation methods • Application of automatic relevance determination to model selection for HMMs?
Acknowledgments • Collaborators: Aravind Ganapathiraju and Joe Picone at Mississippi State • Consultants: Michael Tipping (MSR-Cambridge) and Thorsten Joachims (now at Cornell)