100 likes | 250 Views
TANDEM OBSERVATION MODELS. Introduction. Tandem is a method to use the predictions of a MLP as observation vectors in generative models, e..g. HMMs Extensively used in the ICSI/SRI systems: 10-20 % improvement for English, Arabic, and Mandarin
E N D
Introduction • Tandem is a method to use the predictions of a MLP as observation vectors in generative models, e..g. HMMs • Extensively used in the ICSI/SRI systems: 10-20 % improvement for English, Arabic, and Mandarin • Most previous work used phone MLPs for deriving tandem (e.g., Hermansky et al. ’00, and Morgan et al. ‘05 ) • We explore tandem based on articulatory MLPs • Similar to the approach in Kirchhoff ’99 • Questions • Are articulatory tandems better than the phonetic ones? • Are factored observation models for tandem and acoustic (e.g. PLP) observations better than the observation concatenation approaches?
MLP OUTPUTS LOGARITHM PRINCIPAL COMPONENT ANALYSIS SPEAKER MEAN/VAR NORMALIZATION TANDEM FEATURE Tandem Processing Steps • MLP posteriors are processed to make them Gaussian like • There are 8 articulatory MLPs; their outputs are joined together at the input (64 dims) • PCA reduces dimensionality to 26 (95% of the total variance) • Use this 26-dimensional vector as acoustic observations in an HMM or some other model • The tandem features are usually used in combination w/ a standard feature, e.g. PLP
State Concatenated Observations Factored Observations State Tandem PLP PLP p(X, Y|Q) = p(X|Q) p(Y|Q) Tandem Tandem Observation Models • Feature concatenation: Simply append tandems to PLPs • All of the standard modeling methods applicable to this meta observation vector (e.g., MLLR, MMIE, and HLDA) • Factored models: Tandem and PLP distributions are factored at the HMM state output distributions - Potentially more efficient use of free parameters, especially if streams are conditionally independent • Can use e.g., separate triphone clusters for each observation
Articulatory vs. Phone Tandems • Monophones on 500 vocabulary task w/o alignments; feature concatenated PLP/tandem models • All tandem systems are significantly better than PLP alone • Articulatory tandems are as good as phone tandems • Articulatory tandems from Fisher (1776 hrs) trained MLPs outperform those from SVB (3 hrs) trained MLPs
Concatenation vs. Factoring • Monophone models w/o alignments • All tandem results are significant over PLP baseline • Consistent improvements from factoring; statistically significant on the 500 task
Triphone Experiments • 500 vocabulary task w/o alignments • PLP x Tandem factoring uses separate decision trees for PLP and Tandem, as well as factored pdf’s • A significant improvement from factoring over the feature concatenation approach • All pairs of results are statistically significant
phoneState KLT’ed log MLP outputs, separate from PLP outputs PLPs Observation factoring and weight tuning Factored tandem Results Dimensions of streams Fully factored tandem phoneState PLPs dg1 pl1 rd . . . log outputs of separate MLPs Dims after KLT account for 95% of variance
PLPs KLT’ed log MLP outputs, separate from PLP outputs PLPs dg1 pl1 rd . . . Weight tuning Factored Fully factored MLP weight= 1 Language model tuned for PLP weight=1 Weight tuning in progress
Summary • Tandem features w/ PLPs outperform PLPs alone for both monophones and triphones • 8-13 % relative improvements (statistically significant) • Articulatory tandems are as good as phone tandems - Further comparisons w/ phone MLPs trained on Fisher • Factored models look promising (significant results on the 500 vocabulary task) - Further experiments w/ tying, initialization - Judiciously selected dependencies between the factored vectors, instead of complete independence