1 / 1

SYSTEM ARCHITECTURE

INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros, J. Macías-Guarasa, R. de Córdoba and J.M. Pardo Grupo de Politécnica de Madrid. Tecnología del Habla. Universidad Spain

devika
Download Presentation

SYSTEM ARCHITECTURE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros, J. Macías-Guarasa, R. de Córdoba and J.M. Pardo Grupo de Politécnica de Madrid. Tecnología del Habla. Universidad Spain e-mail: gallardo@tsc.uc3m.es, {jfl, macias, cordoba, pardo}@die.upm.es EXPERIMENTAL SETUP SYSTEM ARCHITECTURE SUMMARY • Previous work on this topic at EUROSPEECH'99: • Flexible Large Vocabulary (up to 10000 words) • Speaker independent. Isolated word • Telephone speech • Two stage - bottom up strategy • Using Neural Networks as a novel approach to estimate preselection list length • VESTEL database realistic telephone speech corpus: • Training set: 5810 utterances: • 46,74 % male speakers • 53,26 % female speakers • Test set: 1434 utterances (vocabulary independent task) • Vocabulary composed of 2000, 5000 and 10000 words • In this paper: • Integrating multiple acoustic models in this system  gender specific models TRAINING MULTIPLE-HMM ACOUSTIC MODELS EXPERIMENTAL RESULTS • Independent-training vs. Joint-training • Discrete HMMs. Dictionary of 2000 words. • Combination 2 in the recognition stage. We have developed two different methods for training gender-dependent models: • Independent training • Each set of models is trained using only the part of the training database assigned to it. Joint training (and II) • Joint training (I) • Each set of models is trained using all the utterances contained in the training database. • A weighting function controls the influence of each utterance in the modeling of each set. • PA and PB are the likelihoods for the utterance with the set of models A and B, respectively. • wA and wB are the weights to be applied to the reestimation formulae in the training stage. • a is an adjustment factor, which allows to assign more training data to a particular set. • Alternatives for PSBU and LA modules • Discrete HMMs. Dictionary of 2000 words. • Independent-training for multiple DHMMs. INCORPORATING MULTIPLE-HMMs IN THE RECOGNITION STAGE • Single-SCHMM vs. Multiple-SCHMM • Independent-training for multiple SCHMMs. • Single-set in PSBU module + Shared-costs in LA module. • Dictionary of 2000, 5000 and 10000 words. • Alternatives for PSBU Combined-sets • Phonetic strings are composed by concatenating models coming from any set. Single-set • Phonetic strings are forced to be generated by only one set of models (the one that produces the best score). • Alternatives for LA Shared-costs • All the allophones have the same behaviour in the LA stage, even if they have been generated from different set of models. • Both sets share the same confusion matrix. Set-dependent costs • Cost are gender-dependent. • One square confusion matrix (for “single-set” PSBU strategy) per set. • A single rectangular confusion matrix (for “combined sets”). CONCLUSIONS • The use of multiple acoustic models per phonetic unit allows increasing acoustic modeling robustness in difficult tasks. • Best results are obtained using: • Training stage: Independent-training. • PSBU module: Single-set. • LA module: Shared Costs. • A relative error reduction around 25 % is achieved when using multiple-SCHMM modeling compared to the single-SCHMM system for a dictionary of 10000 words.

More Related