110 likes | 247 Views
TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION. Daniel P.W. Ellis1, Rita Singh2, and Sunil Sivadas3 2001 ICASSP 2012/10/22 汪逸婷 報告. Outline. Introduction The SPINE1 tandem system Experimental results Discussion Conclusions. 1. Introduction. NN
E N D
TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION Daniel P.W. Ellis1, Rita Singh2, and Sunil Sivadas3 2001 ICASSP 2012/10/22 汪逸婷 報告
Outline • Introduction • The SPINE1 tandem system • Experimental results • Discussion • Conclusions
1. Introduction • NN • When used to estimate the posterior probabilities of a closed set of subword units, they allow discriminative training in a natural and efficient manner. • They also make few assumptions about the statistics of input features. • They have been found well able to cope with highly correlated and unevenly distributed features(like spectral energy features).
1. Introduction • GMM • Often used to build independent distribution models for each subword. • Work best when supplied with low-dimensional, decorrelated input features. • In small task:NN-HMM > GMM-HMM. • In large-vocabulary( DARPA / NIST):NN-HMM << GMM-HMM. • Equivalent adaptation is much more difficult for NN-based systems.
1. Introduction • Tandem system for the NRL SPINE1 task. • Question: • whether GMM systems can outperform NNs in large task? • Would the use of NN feature preprocessor continue to confer an advantage in larger tasks involving more contextual variability? • Would model adaptation schemes such as MLLR be effective in the new feature space defined by the network outputs?
1. Introduction • Corpus: • NRL SPINE1. • 5000 words. • Utterances are predominantly noisy. • Signal-to-noise ratios ranging from 5dB~20dB. • Data consists of human-human dialogs in a battleship game. • WERs in noisy digits task were at 1%. • Below for the best cases, the very best systems in the SPINE1 about 25%.
4. Discussion • The large improvement is mainly eliminated for the context-dependent models. • MLLR results in a greater improvement for the tandem features than for MFC and PLP features(comparable for context-dependent). • The advantages in modeling of CD classes gained by the net’s feature space remapping appear to be largely nullified.
5. Conclusions • Show that tandem using a combination of NN, trained to estimate posterior probabilities of CI can achieve significant and reductions WER. • Further work needs to be done to extend the benefits to CD models.