Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación Universidad Politécnica de Madrid macias@die.upm.es

Overview • Introduction • System architecture • Motivation • Databases & Dictionaries • Experimental results • Conclusions and future work

Abstract • In LVSRS: classify utterances as being correctly or incorrectly recognized is of major interest • Preliminary study on: • Word-level confidence estimation • Multiple features: Acoustical and lexical decoders • Neural Network based scheme

Introduction (I) • ASR Systems rank the output hipothesis according to scores • Confidence on proposed decoding is not a direct byproduct of the process • Lot of work in recent years: • Acoustic and linguistic features • Single or multiple set of parameters • Direct estimation, LDA, NNs, etc.

Introduction (and II) • Traditionally • Acoustic features alone show poor results (likelihoods not comparable across utterances) • Literature centered in description of methods to convert the HMM decoded probabilities into useful confidence measures: • likelihoods (normalized versions) • LM probabilities • n-best decoding lists

System Architecture • Hypothesis-verification strategy Hypothesis Verification Intermediate Unit Generation Lexical Access Verification Module Rough Analysis Detailed • We work in the Hypothesis Module

Detailed Architecture Hypothesis Dicts Indexes. Phonetic string. Speech Preprocessing & VQ processes Phonetic String Build-Up List of Candidate Words Lexical Access Align. costs VQ books HMMs Durats.

Motivation • Studies on variable preselection list length estimation systems • # of words to pass to the verification stage • Direct correlation with confidence estimation: • If proposed list length is small  high confidence • Initial application to hypothesis module only

Databases & Dictionaries • Part of the VESTEL database • Training • 5820 utterances. 3011 Speakers. • Testing • 2536 utters. (vocabulary dependent). 2255 spks • 1434 utters. (vocabulary independent). 1351 spks • Dictionaries: 10000 (VD&I) words

Baseline experiment • Directly using the features (normalized to range 0..1) • Baseline features: • Acoustic log-likelihood (and normalized versions) • Lexical access cost for the 1st candidate • Standard deviation of lexical access costs • Not very good results • Best one with Std Deviation

Baseline distributions  Std Deviation LA Acoustic likelihood(normalized) 

Baseline distributions

Neural Network estimator • Used successfully in preselection list length estimation • Able to combine parameters w/o effort • 3-layer MLP • Wide range of topology alternatives, coding schemes and features: • Direct parameters • Normalized • Lexical Access costs distribution

NN based experiments • Maximum correct classification rates: 70-75% for the three datasets (reasonable, taking into account the preselection rates achieved: 46.95%, 30.14% and 42.47%) • Best single feature: Standard deviation of the lexical access cost measured over the list of the first 10 candidates (0.1% of the dictionary size) • Final system uses 8 parameters (lexical and acoustical-based)

Final distributions  Not using NN Using NN 

Final distributions  Using NN Not using NN 

Additional results • EER: • 30% for PERFDV • 25% for PEIV1000 and PRNOK5TR • Optimum threshold very close to the scale midpoint • Correct rejection rates for given False rejection:

Conclusions • Introduced word-level confidence estimation system based on NNs and a combination of lexical and acoustical features • NN showed to improve results obtained using the features directly • Best parameter is lexical-based and consistent with acoustical-versions reported in the literature (standard deviation is similar to likelihood ratios and n-best related features)

Future work • Extend the comparison of the NN vs non-NN system to all feature set • Extend the work to the verification module (experiments already carried out shows good results) • Extend the approach to CRS (phrase level confidence)

ROC Curves (NN vs. non NN)

Any questions?

Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo

Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo

Presentation Transcript

The Parable of Juan and the Fishbowl

Project Report Microsoft Internet Information Server