1 / 19

Learning Long-Term Temporal Features for Conversational Speech Recognition

Learning Long-Term Temporal Features for Conversational Speech Recognition. A Comparative Study Barry Chen, Qifeng Zhu, & Nelson Morgan (with many thanks to Hynek Hermansky and Andreas Stolcke). Log-Critical Band Energies. Conventional Feature Extraction. Log-Critical Band Energies.

kibo-dudley
Download Presentation

Learning Long-Term Temporal Features for Conversational Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Long-Term Temporal Features for Conversational Speech Recognition A Comparative Study Barry Chen, Qifeng Zhu, & Nelson Morgan (with many thanks to Hynek Hermansky and Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop

  2. Log-Critical Band Energies Conventional Feature Extraction AMI/Pascal/IM2/M4 Workshop

  3. Log-Critical Band Energies TRAPS/HATS Feature Extraction AMI/Pascal/IM2/M4 Workshop

  4. What is a TRAP? • TempoRal Patterns (TRAPs) were developed by our colleagues at OGI: Sharma, Jain, and Sivadas and Hermansky (last 2 now at IDIAP) • TRAP = a narrow frequency speech energy pattern over a period of time (0.5 – 1 second) • TRAPS use neural networks trained to get posteriors (as in Qifeng Zhu’s talk) • Hidden Activation TRAPS (HATS) AMI/Pascal/IM2/M4 Workshop

  5. TRAPS/HATS Motivation • Psychoacoustics -> long time scale • Mutual information -> more than 100 ms • Potential robustness to speech degradations AMI/Pascal/IM2/M4 Workshop

  6. Learn Everything in One Step? AMI/Pascal/IM2/M4 Workshop

  7. Learn in Individual Bands? AMI/Pascal/IM2/M4 Workshop

  8. One-Stage Approach AMI/Pascal/IM2/M4 Workshop

  9. 2-Stage Linear->Nonlinear Approaches AMI/Pascal/IM2/M4 Workshop

  10. 2-Stage MLP-Based Approaches AMI/Pascal/IM2/M4 Workshop

  11. Two Questions TRAPS and HATS:two-stage approaches to learning long-term temporal features • Are these constrained approaches better than an unconstrained approach? • Are the non-linear transformations of critical band trajectories necessary? AMI/Pascal/IM2/M4 Workshop

  12. Experimental Setup • Training: ~68 hours of conversational telephone speech from English CallHome, Switchboard I, and Switchboard Cellular • Testing: 2001 Hub-5 Evaluation Set (Eval2001) • 2,255,609 frames and 62,890 words • Back-end recognizer: SRI’s Decipher System. 1st pass decoding using a bigram language model and within-word triphone acoustic models (thanks to Andreas Stolcke) AMI/Pascal/IM2/M4 Workshop

  13. Frame Classification AMI/Pascal/IM2/M4 Workshop

  14. Standalone Feature System • Transform MLP outputs by: • log transform to make features more Gaussian • PCA for decorrelation • Similar to Tandem setup introduced by [Hermansky, Ellis, and Sharma] (discussed by Qifeng Zhu on Monday) • Use transformed MLP outputs as front-end features for the SRI recognizer AMI/Pascal/IM2/M4 Workshop

  15. ASR with Standalone Temporal Features AMI/Pascal/IM2/M4 Workshop

  16. Combination w/ PLP+ • SRI’s 2003 PLP front-end feature is 12th order PLP with three deltas. Then HLDA transforms 52 to 39 dimensions • Concatenate PCA truncated MLP features to HLDA(PLP+3d) features AMI/Pascal/IM2/M4 Workshop

  17. Combo w/PLP Baseline Features AMI/Pascal/IM2/M4 Workshop

  18. Interpretation • Learning constraints introduced by the 2-stage approach is better than unconstrained learning • Non-linear discriminant transform of HATS is better than linear discriminant transforms from LDA and HATS before sigmoid • Like TRAPS, HATS is complementary to more conventional features and combines synergistically with PLP/MLP (tandem) features. AMI/Pascal/IM2/M4 Workshop

  19. The End AMI/Pascal/IM2/M4 Workshop

More Related