170 likes | 373 Views
Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs. Ozgur Cetin International Compute Science Institute. Goals. Assess the performance of visual features for articulatory feature classification
E N D
Visual Articulatory Feature Classification, and A Comparison of MLP and SVMs Ozgur Cetin International Compute Science Institute
Goals • Assess the performance of visual features for articulatory feature classification • Compare the performance of multi-later perceptrons (MLPs) with support vector machines (SVMs) for articulatory feature classification, for audio
Data • MIT webcam digits corpus • Around 53 minutes of data; use 45 minutes of this data for training, and the remaining 8 minutes for testing purposes • 21 unique phones • Audio observation vectors • MFCCs and their 1st and 2nd order derivatives • 41 dimensions extracted at every 10 msec • Visual observation vectors • DCT coefficients computed on a histogram-equalized medium-sized ROI around lips; upsampled to match audio frames • Speech recognition experiments used the first 35 dimensions; use the first 41 dimensions for comparison with audio features
Articulatory Feature Set 1 • The feature set we decided to use for acoustic modeling (restricted to 21 phones) PL1: sil alv den lab lab-den none rho vel DG1: sil app clo fric vow DG2: sil vow NAS: sil - + ROU: sil - + GLO: sil vl voi VOW: sil ah ao ax ay1 ay2 eh ey1 ey2 ih iy uw nil HT: sil high low mid mid-h mid-l v-hi nil FRT: sil bk frt mid mid-f nil
Articulatory Feature Set 2 • Based on vocal tract variables of phonology • They could be more meaningful to infer than to infer Set 1, from video LIP-LOC: pro, lab, den LIP-OPEN: clo, crit, nar, wide TT-LOC: den, alv, p-a, ret TT-OPEN: clo, crit, nar, m-n, mid, wide TB-LOC: pal, vel, uvu, phar, TB-OPEN: clo, crit, nar, m-n, mid, wide VEL: clo, open GLOT: clo, crit, wide
MLP-based Classification • Train a separate neural network for each articulatory variable: • Input layer: 9 frame window centered around the current frame (41 x 9 = 369 inputs) • Hidden layer: 100 hidden variables (set so that the number of parameters is around 5% of the data) • Output layer: 1-of-N mapping for each articulatory variable (obtained from phonetic forced alignments and phone-to-articulatory mappings)
MLP-based Classification (2) • Training using gradient-descent • Performance monitored on a cross-validation set • Very fast (less than one minute for this data) • Quicknet and Pfile utilities • Mean/variance normalization (pfile_normutts) • Training (qnstrn) • Testing (qnsfwd) • Easy to use and available from www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html
SVM-based Classification • The same processing stages as the MLP-based classifier except: • Use a 5-frame window around the current frame to reduce dimensionality (-4, -2, 0, 2, 4) around the current frame • Training is slow so use only 2500-5000 data points (less than 2% of the all available data) • Randomly selected to balance different classes
SVM-based Classification (2) • Training and testing using Joachims’ SVMmulticlass toolkit • svm_multiclass_learn, and _classify • Based on SVMlightquadratic optimizer • Very easy to use, and slow memory footprint • But, does not seem to scale well with amount of training data, and slow (1-2 days for this data set)
Results: Visual Articulatory Classification using Feature Set 1 • Both audio and visual classifiers are MLP • Performance of audio MLPs is very good • Visual MLPs perform better than the a-priori classifier for PL2, NAS, ROU, and GLO • Still, their performance lags much behind audio MLPs
Results: Visual Articulatory Classification using Feature Set 2 • All classifiers are again MLP • Apart from LIP-OPEN, visual MLPs do not fare better than the a-priori classifier • Priors are so high due to silence
Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 1) • 1 minute of data is too small to learn much from • SVMs use RBF kernel (γ = 1) • Tried several C and γ • SVMs do not seem to be competitive based on these very limited experiments
Results: SVM- vs. MLP-based Classification using Feature Set 2 (γ= 0.1) • 1 minute data is too small to learn much from, but SVMs seem to fare better than MLPs • SVMs are not competitive with MLPs using all data • γ=0.1 is much better than γ=1 gamma (i.e. more training examples contribute during testing)
Conclusions: Visual Articulatory Classification • Visual articulatory classification performance is not very good • Mostly near to or the same as the a-priori classifier • Exploit the dependency between articulatory variables • Try different visual features, or data set? • The ASR performance of the current visual features is not very good either (around 80% WER)
Conclusions: SVM-based Classification • SVMs seem to be competitive for smalls amount of data, but not with the MLPs using all the data • The main problems I found to be the training time and the amount of data they can handle • Alternatives: • Instead of frame-based classification, try sequence-based classification, e.g. for each phone segment, using Fisher kernel • This would significantly reduce the amount of data, and it is actually what we do in practice