170 likes | 338 Views
My Slides. Support vector machines (brief intro) WS04: What we accomplished WS04: Organizational lessons AVICAR video corpus: current status. Support Vector Machines as (sort of) compared to Neural Networks*.
E N D
My Slides • Support vector machines (brief intro) • WS04: What we accomplished • WS04: Organizational lessons • AVICAR video corpus: current status
Support Vector Machinesas (sort of) compared toNeural Networks* * Difficult to do because they have never been compared head-to-head on any speech task!
SVM = Regularized Nonlinear Discriminant Kernel: Transform to Infinite- Dimensional Hilbert Space The only way in which SVM differs from RBF-NN: THE TRAINING CRITERION SVM Discriminant Dimension c = argmin( training_error(c) + l/width(margin(c)) ) SVM Extracts a Discriminant Dimension (Bourlard/Morgan Hybrid) Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension OR (BDFK Tandem) Borys & Hasegawa-Johnson, 2005: Likelihood = Mixture Gaussian in Discriminant Dimension
Advantages of SVMs w.r.t. NNs • Accuracy: • SVM generalizes much better from small training data sets ( training tokens > 6X observation vector size ) • As training data size increases, accuracy of NN and SVM converge • Theoretically, and in some practical experiments too • Like 3-layer-MLP, RBF-SVM is a universal approximator • Fast training: nearly quadratic optimality criterion
Disadvantages of SVMs w.r.t. NNs • No way to train with very large training set • Complexity = O(N^2): either fast or impossible • Computational complexity during test • Solution: Burges’ Reduced Set Method (extra training step; only available right now in matlab) • Accuracy: unless you optimize the hyper-parameters, accuracy is good but not great • Exhaustive hyper-training is very slow • Can get good accuracy but not great accuracy with the “theoretically correct” hyper-parameters
Disadvantages of SVMs w.r.t. NNs • The real problem: We need phonetically labeled training data • “Embedded re-estimation experiment:” • pre-trained SVMs used as HMM input (tandem system) • RBF weights re-estimated, together with HMM params, in order to maximize likelihood of the training data • Result: Training Data Likelihood, WRA increased • Result: Test Data WRA decreased
WS04: SVM/DBN hybrid recognizer Word LIKE A Canonical Form … Tongue closed Tongue Mid Tongue front Tongue open … Surface Form Tongue front Semi-closed Tongue Front Tongue open … Manner Glide Front Vowel Place Palatal … SVM Outputs p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) x: Multi-Frame Observation including Spectrum, Formants, & Auditory Model …
WS04 Organizational Lessons:What Worked • Innovative experiments, made possible by people who really wanted to be doing what they were doing • Result: Published ideas were interesting to many people • Parallel SVM classification experiments allowed us to test many different SVM definitions • Result: Classification errors mostly below 20% by end of WS • Parallel recognizer test experiments (DBN/SVM was one, MaxEnt-based lattice rescoring was another) • Result: both achieved small (nonsignificant) WER reduction over baseline
WS04 Organizational Lessons:What Didn’t Work • Software Bottleneck between the SVMs and the recognizers: Only one tool available to apply an SVM to every frame in a speech file, and only person knew how to use it. • Too Many Experimental Variables: Should SVMs be trained using (1) all frames, or (2) only landmark frames? DBN expects #1. HMM works best if manner features are #1, place features are #2. DBN? Impossible to test in six weeks. • Apples & Oranges: SVM-only classifier outputs in cases #1, #2 were incomparable => no test short of full DBN integration is meaningful.
WS04 Organizational Lessons:What Didn’t Work • Unbeatable baseline: Goal was to rescore the output of the SRI recognizer in order to reduce WER => to find acoustic information not already used by the baseline recognizer. • What information is “not already used?” Phone-based ANN/HMM hybrid system: hard to say. • When an experiment fails: why? • Better: use open-source baseline (= not state of the art, but that’s OK), construct test systems in a continuum between baseline and target.
8 Mics, Pre-amps, Wooden Baffle. Best Place= Sunvisor. 4 Cameras, Glare Shields, Adjustable Mounting Best Place= Dashboard AVICAR: Recording Hardware System is not permanently installed; mounting requires 10 minutes.
AVICAR: Data Summary • 100 Talkers • 5 noise conditions: • Engine idling, • 35mph, windows closed / windows open • 55mph, windows closed / windows open • 4 types of utterances: • Isolated Digits • Phone numbers • Isolated Letters (e-set = articulation test) • TIMIT sentences • Public release: 16 schools & companies (but I don’t know how many are using it)
AVICAR: Labeling & Recognition • Manual lip segmentation: 36 images • Automatic face tracking: nearly perfect • Automatic lip tracking: not so good • Manual audio segmentation: sentence boundaries • Audio Enhancement: • Audio Digit WRA 97, 89, 87, 84, 78%
AVICAR: Data Problems • DIVX encoding => database < 300G, but… • DIVX => poor edge quality in some images • Amelioration plan: re-transfer from tapes in high quality, huge data size for folks who want it.