1 / 17

Adaptive Acoustic Model for Speech Variations Using Hybrid ANN/HMM

This study explores adapting acoustic models to various speech variations through Hybrid ANN/HMM. Tasks include linear input and hidden networks, addressing catastrophic forgetting through conservative training, with results on multiple adaptation scenarios. Adaptation factors include specific speakers, speaking styles, audio channels, environments, and vocabulary, improving voice applications and ASR data logs. The study compares LIN and LHN adaptations, emphasizing Speaker/Task Independence and MLP improvements, reducing error rates. Conclusions highlight the effectiveness of LHN over LIN in adaptation tasks, with conservative training mitigating catastrophic forgetting and enhancing performance in speaker adaptation.

junet
Download Presentation

Adaptive Acoustic Model for Speech Variations Using Hybrid ANN/HMM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adapting Hybrid ANN/HMM to Speech Variations Stefano Scanzio, Pietro Laface Politecnico di Torino Dario Albesano, Roberto Gemello, and Franco Mana

  2. Acoustic Model Adaptation • Adaptation tasks • Linear Input Network • Linear Hidden Network • Catastrophic forgetting • Conservative Training • Results on several adaptation tasks

  3. Adapted Models Task independent models ANN Adaptation Acoustic model adaptation • Specific speaker • Speaking style (spontaneous, regional accents) • Audio channel (telephone, cellular, microphone) • Environment (car, office, …) • Specific vocabulary Voice Application ASR Data Log

  4. …. …. LIN Linear Input Network adaptation Acoustic phonetic units Emission Probabilities …. Speaker/Task Independent MLP …. Input layer Speech parameters

  5. LHN …. Linear Hidden Network - LHN Acoustic phonetic units Emission Probabilities …. …. Hidden layer 2 Hidden layer 1 …. Input layer Speech parameters

  6. Catastrophic forgetting • Acquiring new information can damage previously learned information if the new data that do not adequately represent the knowledge included in the original training data • This effect is evident when adaptation data do not contain examples for a subset of the output classes. • Problem is more severe in ANN framework than in the Gaussian Mixture HMM framework

  7. Catastrophic forgetting • Back-propagation algorithm penalizes classes with no adaptation examples setting their targetvaluetozero for every adaptation frame • Thus, during adaptation, the weights of the ANN will be biased • to favor the activations of the classes with samples in the adaptation set • to weaken the other classes.

  8. 6 7 Error Rate : 1.5% Error Rate : 6.7% Total Error Rate: 4.1% 6 7 16-classes training 20 x 2 hidden nodes 2 input nodes 16 output nodes 2500 patterns per class The adaptation set includes 5000 patterns belonging only to classes 6 and 7

  9. Error Rate : 0% Error Rate : 2.0% Total Error Rate: 16.9% 6 7 Adaptation of 2 Classes 6 7

  10. Posterior probability computed using the original network M1 P1 P2 P3 M2 0.03 0.00 0.95 0.00 0.02 Conservative Training target assignment policy Standard target assignment policy 0.00 0.00 1.00 0.00 0.00 P2 is the class corresponding to the current input frame Px: class in the adaptation set Mx: missing class

  11. 6 7 6 7 Error Rate : 0% Error Rate : 2% Total Error Rate: 16.9% Error Rate : 2.2% Error Rate : 5.2% Total Error Rate: 10.2% 6 7 Adaptation of 2 classes Standard adaptation Conservative Training adaptation 6 7

  12. Adaptation tasks • Application data adaptation:Directory Assistance • 9325 Italian city names • 53713 training + 3917 test utterances • Vocabulary adaptation: Command words • 30 command words • 6189 training + 3094 test utterances • Channel-Environment adaptation: Aurora-3 • 2951 training + 654 test utterances • Speaker adaptation: WSJ0 • 8 speakers, 16KHz • 40 test + 40 train sentences

  13. Results on different tasks (%WER)

  14. Mitigation of Catastrophic Forgetting using Conservative Training Tests using adaptedmodels on Italian continuous speech (% WER)

  15. Networks used in Speaker Adaptation Task • STD (Standard) • 2 hidden layer hybrid MLP-HMM model • 273 input features (39 parameters and 7 context frames) • IMP (Improved) • Uses a wider input window spanning a time context of 25 frames • Includes an additional hidden layer

  16. Results on WSJ0 Speaker Adaptation Task

  17. Conclusions • LHN adaptation outperforms LIN adaptation • Linear transformations at different levels produce different positive effects • LIN+LHN performs better thanLHN • In adaptation tasks with missing classes, Conservative Training • reduces the catastrophic forgetting effect, preserving the performance on another generic task • improve the performance in speaker adaptation with few available sentences

More Related