190 likes | 348 Views
Kernel-Based Detectors and Fusion of Phonological Attributes. Brett Matthews Mark Clements. Outline. Frame-Based Detection One-vs-all detectors Context-dependent framewise detection Probabilistic Outputs Kernel-Based Attribute Detection SVM Least-Squares SVM
E N D
Kernel-Based Detectors and Fusion of Phonological Attributes Brett Matthews Mark Clements
Outline • Frame-Based Detection • One-vs-all detectors • Context-dependent framewise detection • Probabilistic Outputs • Kernel-Based Attribute Detection • SVM • Least-Squares SVM • Evaluating Probabilistic Estimates • Naïve Bayes Combinations • Hierarchical Manner Classification • Detector Fusion • Genetic Programming
vowel silence dental velar voicing Frame-Based Detection • One-vs-All classifiers • Manner of articulation • Vowel, fricative, stop, nasal, glide/semivowel, silence • Place of articulation • Dental, labial, coronal, palatal, velar, glottal, back, front • Vowel Manners • High, mid, low, back, round • Framewise Detection • 10ms frame rate • 12 MFCCs+En • 8 context dependent frames • Classifier Types & Posterior Probs • Artificial Neural nets • Probabilistic outputs • Kernel-Based Classifiers • SVM • Empirically determined posterior probs • LS-SVMs • Probabilistic outputs Event Fusion
w Kernel-Based Classifiers • Support Vector Machines (SVM) • LS-SVM Classifier • Kernel-based classifier like SVM • Least-Squares formulation • Probabilistic output scores • LS-SVM Lab package • Katholieke Universeit Lueven • Same decision function as SVM • Subject to • Equality constraints, instead of inequality constraints • No margin optimization • Linear system solution
Least Squares SVMs • “Support Vectors” a found by solving a linear system • Kernel Functions • Probabilistic Outputs • Bayesian Inference for Posterior probs • Moderated outputs can be directly interpreted as posterior probabilities Linear Polynomial RBF
Evaluating Probabilistic Estimates • Reliability and Accuracy of probabilistic scores • Initial Fusion Experiments • Hierarchical Manner Classification • LS-SVM, SVM • Naïve Bayes combination for Phone Detection • LS-SVM, SVM, ANN LS-SVM SVM
Hierarchical Combinations • Probabilistic Phonetic feature hierarchy for classifying frames into 6 manner classes • Train binary detectors on each split in hierarchy • 5 Detectors, 6 classes • silence vs speech • sonorant | speech • vowel | sonorant • stop | non-sonorant • semivowel | sonorant consonant P(fric | x) = (1 – P(st | non-sc)) · (1 – P(son | spch)) · P (spch | x) fricative detection and gnd truth
Hierarchical Combinations LS-SVM (Combined) • Reliability of Posterior Probs (right) • Plot probabilistic estimates of combinations vs. observed frequencies • Hierarchical Combinations much more reliable for SVM than LS-SVM • Classification Accuracy (below) • Higher classification accuracy for SVMs, especially fricatives • Upper-bound Comparison (below) • One-vs-all classifiers trained directly for each class. • Combinations nearly as accurate as one-vs-all for classification performance • LS-SVM combinations not good for semivowel and nasal vowel stop fricative semivowel/ glide nasal silence SVM (Combined) stop vowel fricative semivowel/ glide nasal silence Classification accuracy (%)
Naïve Bayes Combinations • One-vs-all frameworks desired • Phonetic hierarchies are cumbersome • Phone Detection • Combine phonological attribute scores with Naïve Bayes product • Initial experiments in evaluating probabilities • Compare accuracy and reliability of probabilistic outputs for ANN, SVM and LS-SVM • Limited training data (LS-SVM limit is 3000 due to memory restrictions) • Detect phones with combinations of relevant phonetic attributes P(/f/ | x) = P(labial | x) P(fric | x) (1-P(voicing | x))
Naïve Bayes Combinations • Phone Detection • Compare combined attributes with direct training on phones as an upper bound • ROC Stats (right) • SVMs best for attribute detection • Mixed results for NB combinations • No clear winner between LS-SVM and SVM • Direct training outperforms combinations • Reliability • Naïve Bayes combinations give poor reliability for all detector types • Rare phones & vowels • For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right) • Most vowels saw improvements as well ROC Stats Direct vs. Combined
Phone Detection Compare combined attributes with direct training on phones as an upper bound ROC Stats (right) SVMs best for attribute detection Mixed results for NB combinations No clear winner between LS-SVM and SVM Direct training outperforms combinations Reliability Naïve Bayes combinations give poor reliability for all detector types Rare phones & vowels For /v/, /ng/ and /oy/, improvements in EER and AUC across detector types (lower right) Most vowels saw improvements as well Naïve Bayes Combinations Combined attributes (SVM) Direct Training (SVM)
Genetic Programming • Evolutionary algorithm for tree-structured feature “creation” (Extraction) • Maximize a fitness function across a number of generations (iterations) • Operations like crossover & mutation control the evolution of the algorithm • Trees are algebraic networks • Inputs are multi-dimensional features • Tree nodes are unary or binary mathematical operators (+, -, *, (.)2, log) • Algebraic networks simpler and more transparent than neural nets • GPLab Package from Universidade de Coimbra, Portugal • http://gplab.sourceforge.net
1-D feature /aa/ vowel /ae/ silence dental velar /zh/ voicing Genetic Programming • Trained GP trees on SVM outputs • Develop algebraic networks for combining detector outputs • Produce a 1-D feature from a nonlinear combination of detector outputs • choose fitness function, set of node operators, tree depth, etc. to maximize separation • Trees are algebraic networks • Inputs are multi-dimensional features • Tree nodes are unary or binary mathematical operators (+, -, *, (.)2, log) • Algebraic networks simpler and more transparent than neural nets
/oy/ /th/ Genetic Programming /oy/ • System is complex for speech recognition (tree + classifier for each phone), but GP trees themselves provide insights for combination • Fitness function • Tree node operators • Important features • Initial results • Mixed results • Good separation for some phones, not good for most • GP Trees select attributes of interest, discard others • Still in progress /th/
Summary • Evaluating Posterior Probs • ANNs, SVMs, LS-SVMs • SVMs are best for reliability and accuracy • In limited training data, rare phones may benefit from from overlapping phonetic classes • Genetic Programming for detector fusion • Small, transparent algebraic networks for combining attribute detectors • GP trees select relevant attributes, but much room for improvement • Limiting tree node operators and selecting “fitness functions” should provide insights into detector fusion
Extras Feature Space correlation matrix (1) Feature Space correlation matrix (2) Feature Space correlation matrix (3) Training Data Represents the kernel function K and the range of kernel parameters
Extras Determine w and b by solving the optimization problem Subject to Generalization/ Regularization term Regression error for training sample k Expression for the trade-off between generalization and training set error Positive scale parameters
Extras • Support Vector Machines • Good performance, but the majority of training points became support vectors • Posterior probabilities w