210 likes | 374 Views
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System. Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo. Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica
E N D
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación Universidad Politécnica de Madrid macias@die.upm.es
Overview • Introduction • System architecture • Motivation • Databases & Dictionaries • Experimental results • Conclusions and future work
Abstract • In LVSRS: classify utterances as being correctly or incorrectly recognized is of major interest • Preliminary study on: • Word-level confidence estimation • Multiple features: Acoustical and lexical decoders • Neural Network based scheme
Introduction (I) • ASR Systems rank the output hipothesis according to scores • Confidence on proposed decoding is not a direct byproduct of the process • Lot of work in recent years: • Acoustic and linguistic features • Single or multiple set of parameters • Direct estimation, LDA, NNs, etc.
Introduction (and II) • Traditionally • Acoustic features alone show poor results (likelihoods not comparable across utterances) • Literature centered in description of methods to convert the HMM decoded probabilities into useful confidence measures: • likelihoods (normalized versions) • LM probabilities • n-best decoding lists
System Architecture • Hypothesis-verification strategy Hypothesis Verification Intermediate Unit Generation Lexical Access Verification Module Rough Analysis Detailed • We work in the Hypothesis Module
Detailed Architecture Hypothesis Dicts Indexes. Phonetic string. Speech Preprocessing & VQ processes Phonetic String Build-Up List of Candidate Words Lexical Access Align. costs VQ books HMMs Durats.
Motivation • Studies on variable preselection list length estimation systems • # of words to pass to the verification stage • Direct correlation with confidence estimation: • If proposed list length is small high confidence • Initial application to hypothesis module only
Databases & Dictionaries • Part of the VESTEL database • Training • 5820 utterances. 3011 Speakers. • Testing • 2536 utters. (vocabulary dependent). 2255 spks • 1434 utters. (vocabulary independent). 1351 spks • Dictionaries: 10000 (VD&I) words
Baseline experiment • Directly using the features (normalized to range 0..1) • Baseline features: • Acoustic log-likelihood (and normalized versions) • Lexical access cost for the 1st candidate • Standard deviation of lexical access costs • Not very good results • Best one with Std Deviation
Baseline distributions Std Deviation LA Acoustic likelihood(normalized)
Neural Network estimator • Used successfully in preselection list length estimation • Able to combine parameters w/o effort • 3-layer MLP • Wide range of topology alternatives, coding schemes and features: • Direct parameters • Normalized • Lexical Access costs distribution
NN based experiments • Maximum correct classification rates: 70-75% for the three datasets (reasonable, taking into account the preselection rates achieved: 46.95%, 30.14% and 42.47%) • Best single feature: Standard deviation of the lexical access cost measured over the list of the first 10 candidates (0.1% of the dictionary size) • Final system uses 8 parameters (lexical and acoustical-based)
Final distributions Not using NN Using NN
Final distributions Using NN Not using NN
Additional results • EER: • 30% for PERFDV • 25% for PEIV1000 and PRNOK5TR • Optimum threshold very close to the scale midpoint • Correct rejection rates for given False rejection:
Conclusions • Introduced word-level confidence estimation system based on NNs and a combination of lexical and acoustical features • NN showed to improve results obtained using the features directly • Best parameter is lexical-based and consistent with acoustical-versions reported in the literature (standard deviation is similar to likelihood ratios and n-best related features)
Future work • Extend the comparison of the NN vs non-NN system to all feature set • Extend the work to the verification module (experiments already carried out shows good results) • Extend the approach to CRS (phrase level confidence)