530 likes | 661 Views
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology. Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA. Lecture 5: Generalization Error; Support Vector Machines.
E N D
Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA
Lecture 5: Generalization Error; Support Vector Machines • Observation Vector Summary Statistic; Principal Components Analysis (PCA) • Risk Minimization • If Posterior Probability is known: MAP is optimal • Example: Linear Discriminant Analysis (LDA) • When true Posterior is unknown: Generalization Error • VC Dimension, and bounds on Generalization Error • Lagrangian Optimization • Linear Support Vector Machines • The SVM Optimality Metric • Lagrangian Optimization of SVM Metric • Hyper-parameters & Over-training • Kernel-Based Support Vector Machines • Kernel-based classification & optimization formulas • Hyperparameters & Over-training • The Entire Regularization Path of the SVM • High-Dimensional Linear SVM • Text classification using indicator functions • Speech acoustic classification using redundant features
What is an Observation? • Observation can be: • A vector created by “vectorizing” many consecutive MFCC or mel-spectra • A vector including MFCC, formants, pitch, PLP, auditory model features, …
Plotting the Observations, Part I: Scatter Plots and Histograms
Problem: Where is the Information in a 1000-Dimensional Vector?
Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1
Plotting the Observations, Part 2: Principal Components Analysis
What Does PCA Extract from the Spectrogram? Plot: “PCAGram” • 1024-dimensional principal component → 32X32 spectrogram, plot as an image: • 1st principal component (not shown) measures total energy of the spectrogram • 2nd principal component: E(after landmark) – E(before landmark) • 3rd principal component: E(at the landmark) – E(surrounding syllables)
Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio
Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize Remp(v,b))
A Serious Problem: Over-Training The same projection, applied to new test data Minimum-Error projection of training data
Schematic Depiction: |w| Controls the Expressiveness of the Classifier(and a less expressive classifier is less prone to overtrain)
Lagrangian Optimization: Inequality Constraint • Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: • g(v*) = 0 • g(v)=0 curve is tangent to f(v)=fmin curve at v=v* • g(v*) > 0 • v* minimizes f(v) g(v) < 0 Unconstrained Minimum g(v) < 0 v* g(v) = 0 v* g(v) > 0 g(v) > 0 g(v) = 0 Diagram from Osborne, 2004
Three Types of Vectors Interior Vector: a=0 Margin Support Vector: 0<a<C Error: a=C Partial Error: a=C From Hastie et al., NIPS 2004
Quadratic Programming ai2 C ai1 C ai* ai2 is off the margin; truncate to ai2=0. ai1 is still a margin candidate; solve for it again in iteration i+1.
Choosing the Hyper-Parameter to Avoid Over-Training(Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels.
Choosing the Hyper-Parameter to Avoid Over-Training • Recall that v=Sm amymxm • Therefore, |v| < (C Sm |xm|)1/2 < (CM max|xm|)1/2 • Therefore, width of the margin is constrained to 1/|v| > (CM max|xm|)-1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors • Recommended solution: • Normalize xm so that max|xm|≈1 (e.g., using libsvm) • Set C≈1/M • If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease
Two Hyperparameters Choosing Hyperparameters is Much Harder(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
Optimum Value of C Depends on g(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004
SVM Coefficients are a Piece-Wise Linear Function of l=1/C(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
The Entire Regularization Path of the SVM: Algorithm(Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) • Start with l large enough (C small enough) so that all training tokens are partial errors (am=C). Compute the solution to the quadratic programming problem in this case, including inversion of XTX or XXT. • Reduce l (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes dam/dl change, but only for the two training vectors the margin; all other training vectors continue to have am=C.Calculate the new values of dam/dl for these two training vectors. • Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: • A value of am that was on the margin leaves the margin, i.e., the piece-wise-linear function am(l) hits am=0 or am=C. • One or more interior points enter the margin, i.e., in the QP problem, am=0 becomes the unconstrained solution rather than just the constrained solution. • One or more interior points enter the margin, i.e., in the QP problem, am=C becomes the unconstrained solution rather than just the constrained solution.
One Method for Using SVMPath (WS04, Johns Hopkins, 2004) • Download SVMPath code from Trevor Hastie’s web page • Test several values of g, including values within a few orders of magnitude from g=1/K. • For each candidate value of g, use SVMPath to find the C-breakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of am. • Test the SVMs on a separate development test database: for each combination (C,g), find the development test error. Choose the combination that gives least development test error.
Results, RBF SVM SVM test corpus error vs. l=1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004
Motivation: “Project it Yourself” • The purpose of a nonlinear SVM: • f(x) contains higher-order polynomial terms in the elements of x. • By combining these higher-order polynomial terms, SymamK(x,xm) can create a more flexible boundary than can SymamxTxm. • The flexibility of the boundary does not lead to generalization error: the regularization term l|v|2 avoids generalization error. • A different approach: • Augment x with higher-order terms, up to a very large dimension. These terms can include: • Polynomial terms, e.g., xixj • N-gram terms, e.g., (xi at time t AND xj at time t) • Other features suggested by knowledge-based analysis of the problem • Then: apply a linear SVM to the higher-dimensional problem
Example #1: Acoustic Classification of Stop Place of Articulation • Feature Dimension: K=483/10ms • MFCCs+d+dd, 25ms window: K=39/10ms • Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms • Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms • Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms • Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms • Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions • Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. • RBF SVM still outperforms linear SVM, but only by 1%