200 likes | 389 Views
Adapting Hybrid ANN/HMM to Speech Variations. Stefano Scanzio, Pietro Laface Politecnico di Torino Dario Albesano, Roberto Gemello, and Franco Mana. Acoustic Model Adaptation. Adaptation tasks Linear Input Network Linear Hidden Network Catastrophic forgetting Conservative Training
E N D
Adapting Hybrid ANN/HMM to Speech Variations Stefano Scanzio, Pietro Laface Politecnico di Torino Dario Albesano, Roberto Gemello, and Franco Mana
Acoustic Model Adaptation • Adaptation tasks • Linear Input Network • Linear Hidden Network • Catastrophic forgetting • Conservative Training • Results on several adaptation tasks
Adapted Models Task independent models ANN Adaptation Acoustic model adaptation • Specific speaker • Speaking style (spontaneous, regional accents) • Audio channel (telephone, cellular, microphone) • Environment (car, office, …) • Specific vocabulary Voice Application ASR Data Log
…. …. LIN Linear Input Network adaptation Acoustic phonetic units Emission Probabilities …. Speaker/Task Independent MLP …. Input layer Speech parameters
LHN …. Linear Hidden Network - LHN Acoustic phonetic units Emission Probabilities …. …. Hidden layer 2 Hidden layer 1 …. Input layer Speech parameters
Catastrophic forgetting • Acquiring new information can damage previously learned information if the new data that do not adequately represent the knowledge included in the original training data • This effect is evident when adaptation data do not contain examples for a subset of the output classes. • Problem is more severe in ANN framework than in the Gaussian Mixture HMM framework
Catastrophic forgetting • Back-propagation algorithm penalizes classes with no adaptation examples setting their targetvaluetozero for every adaptation frame • Thus, during adaptation, the weights of the ANN will be biased • to favor the activations of the classes with samples in the adaptation set • to weaken the other classes.
6 7 Error Rate : 1.5% Error Rate : 6.7% Total Error Rate: 4.1% 6 7 16-classes training 20 x 2 hidden nodes 2 input nodes 16 output nodes 2500 patterns per class The adaptation set includes 5000 patterns belonging only to classes 6 and 7
Error Rate : 0% Error Rate : 2.0% Total Error Rate: 16.9% 6 7 Adaptation of 2 Classes 6 7
Posterior probability computed using the original network M1 P1 P2 P3 M2 0.03 0.00 0.95 0.00 0.02 Conservative Training target assignment policy Standard target assignment policy 0.00 0.00 1.00 0.00 0.00 P2 is the class corresponding to the current input frame Px: class in the adaptation set Mx: missing class
6 7 6 7 Error Rate : 0% Error Rate : 2% Total Error Rate: 16.9% Error Rate : 2.2% Error Rate : 5.2% Total Error Rate: 10.2% 6 7 Adaptation of 2 classes Standard adaptation Conservative Training adaptation 6 7
Adaptation tasks • Application data adaptation:Directory Assistance • 9325 Italian city names • 53713 training + 3917 test utterances • Vocabulary adaptation: Command words • 30 command words • 6189 training + 3094 test utterances • Channel-Environment adaptation: Aurora-3 • 2951 training + 654 test utterances • Speaker adaptation: WSJ0 • 8 speakers, 16KHz • 40 test + 40 train sentences
Mitigation of Catastrophic Forgetting using Conservative Training Tests using adaptedmodels on Italian continuous speech (% WER)
Networks used in Speaker Adaptation Task • STD (Standard) • 2 hidden layer hybrid MLP-HMM model • 273 input features (39 parameters and 7 context frames) • IMP (Improved) • Uses a wider input window spanning a time context of 25 frames • Includes an additional hidden layer
Conclusions • LHN adaptation outperforms LIN adaptation • Linear transformations at different levels produce different positive effects • LIN+LHN performs better thanLHN • In adaptation tasks with missing classes, Conservative Training • reduces the catastrophic forgetting effect, preserving the performance on another generic task • improve the performance in speaker adaptation with few available sentences