180 likes | 289 Views
Torino Meeting – 9-10 March 2006. Advances in WP2. www.loquendo.com. Activities on WP2 since last meeting. Study of innovative NN adaptation methods Models: Linear Hidden Networks Test on project adaptation corpora: WSJ0 Adaptation component WSJ1 Spoke-3 component
E N D
Torino Meeting – 9-10 March 2006 Advances in WP2 www.loquendo.com
Activities on WP2 since last meeting • Study of innovative NN adaptation methods • Models: Linear Hidden Networks • Test on project adaptation corpora: • WSJ0 Adaptation component • WSJ1 Spoke-3 component • Hiwire Non-Native Corpus
LIN Adaptation for HMM/NN • LIN means “linear input network” • LIN in a classical technique for speaker and channel adaptationin HMM/NN [Neto 1996]; • The LIN is placed before an MLPalready trained in a speaker independent way (SI-MLP) • The input space is rotated by a linear transform, to make the target conditions nearer to the training conditions • The linear transform is implemented with a linear neural network inserted between the input layer and the 1st hidden layer
…. LIN LIN Adaptation Acoustic phonetic Units Emission Probabilities …. Output layer Speaker Independent MLP SI-MLP …. 2nd hidden layer 1st hidden layer …. Input layer Speech Signal parameters
LIN Training • The global SI-MLP+LIN system is trained with vocal material from the target speaker; • The LIN is initialized with an identity matrix; • LIN weights are trained with error back-propagation through the global net; • The original NN weights are kept frozen
LHN Adaptation • LHN means “linear hidden network” • The activations of the last hidden layer are linearly transformed to improve acoustic matching of the adaptation material • The activation values of a hidden layer represent an internal structure of the input pattern in a space more suitable for classification and adaptation • The linear transform is implemented with a linear neural network layer inserted between the last hidden layer and the output layer
…. LHN LHN Adaptation Acoustic phonetic Units Emission Probabilities …. Output layer Speaker Independent MLP SI-MLP …. 2nd hidden layer 1st hidden layer …. Input layer Speech Signal parameters
LHN Training • The global SI-MLP+LHN system is trained with vocal material from the target speaker; • The LHN is initialized with an identity matrix; • LHN weights are trained with error back-propagation through the last layer of weights; • The original NN weights are kept frozen
Paper at Icassp-2006 ADAPTATION OF HYBRID ANN/HMM MODELS USING LINEAR HIDDEN TRANSFORMATIONS AND CONSERVATIVE TRAINING Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface and Renato De Mori
WSJ0 LIN-LHN Adaptation • Train: standard WSJ0 SI-84 train set, 16 kHz • SI Test : 8 speakers and ~40 sentences for each speaker • Vocabulary: 5K words, with a standard bigram LM • Adaptation : the same 8 speakers of SI test, with 40 adaptation sentences for each of them
THE FEMALE PRODUCES A LITTER OF TWO TO FOUR YOUNG IN NOVEMBER AND DECEMBER WSJ1 – SPOKE-3 LIN-LHN Adaptation • Spoke-3 is the standard WSJ1 case study to evaluate adaptation to non-native speakers • There are 10 non-native speakers (40 adaptation sentences and ~40 test sentences) • Train: standard WSJ0 SI-84 train set, 16 kHz • Vocabulary is 5K words, with standard bigram LM
LIN does work for speaker adaptation: • E.R. 10.5% on WSJ0 and 14.2% on WSJ1 • However, with LIN in some cases performances does not improve or decrease • LHN is a more powerful method: • E.R. 20.0% on WSJ0 and 43.5% on WSJ1 • with LHN performances always increase Comments on WSJ0 – WSJ1 Results
Hiwire Non-Native Corpus (1) • The database consists of English sentences uttered by non-native speakers. • These speakers are from French, Italian, Greek and Spanish origins (plus an additional set of extra-European speakers). • The uttered sentences belong to a command language used by aircraft pilots. • The vocabulary contains 134 words. • Each speaker has pronounced 1 list of 100 sentences.
Hiwire Non-Native Corpus (2) • Corpus composition: • French speakers: 31 • Italian speakers: 20 • Greek speakers: 20 • Spanish speakers: 10 • World speakers: 10
Experimental conditions • Starting models: • standard Loquendo ASR EN-US • Telephone models (8 kHz) • Training set: LDC Macrophone • Adaptation: first 50 utterances of each speaker • Test: last 50 utterances of each speaker • LM: Hiwire grammar (134 words voc.) • Signal proc.: down-sampling to 8 kHz
Results on Hiwire corpus • Recognition model: ANN/HMM • Adaptation Model: LIN - LHN
Discussion • The adaptation of Acoustic Models gives a good contribution also in the case of non-native speakers • State-of-art LIN is a feasible and practical way to adapt hybrid NN-HMM models • LHN (transformation of hidden layers activations) is a new NN adaptation method introduced in the project • LHN outperforms LIN
Workplan • Selection of suitable benchmark databases (m6) • Baseline set-up for the selected databases (m8) • LIN adaptation method implemented and experimented on the benchmarks (m12) • Experimental results on Hiwire database with LIN (m18) • Innovative NN adaptation methods and algorithms for acoustic modeling and experimental results (m21) • Further advances on new adaptation methods (m24) • Unsupervised Adaptation: algorithms and experimentation (m33)