1 / 21

Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo

Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System. Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo. Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica

Download Presentation

Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier Ferreiros, Rubén San-Segundo, Juan M. Montero and José M. Pardo Grupo de Tecnología del Habla Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación Universidad Politécnica de Madrid macias@die.upm.es

  2. Overview • Introduction • System architecture • Motivation • Databases & Dictionaries • Experimental results • Conclusions and future work

  3. Abstract • In LVSRS: classify utterances as being correctly or incorrectly recognized is of major interest • Preliminary study on: • Word-level confidence estimation • Multiple features: Acoustical and lexical decoders • Neural Network based scheme

  4. Introduction (I) • ASR Systems rank the output hipothesis according to scores • Confidence on proposed decoding is not a direct byproduct of the process • Lot of work in recent years: • Acoustic and linguistic features • Single or multiple set of parameters • Direct estimation, LDA, NNs, etc.

  5. Introduction (and II) • Traditionally • Acoustic features alone show poor results (likelihoods not comparable across utterances) • Literature centered in description of methods to convert the HMM decoded probabilities into useful confidence measures: • likelihoods (normalized versions) • LM probabilities • n-best decoding lists

  6. System Architecture • Hypothesis-verification strategy Hypothesis Verification Intermediate Unit Generation Lexical Access Verification Module Rough Analysis Detailed • We work in the Hypothesis Module

  7. Detailed Architecture Hypothesis Dicts Indexes. Phonetic string. Speech Preprocessing & VQ processes Phonetic String Build-Up List of Candidate Words Lexical Access Align. costs VQ books HMMs Durats.

  8. Motivation • Studies on variable preselection list length estimation systems • # of words to pass to the verification stage • Direct correlation with confidence estimation: • If proposed list length is small  high confidence • Initial application to hypothesis module only

  9. Databases & Dictionaries • Part of the VESTEL database • Training • 5820 utterances. 3011 Speakers. • Testing • 2536 utters. (vocabulary dependent). 2255 spks • 1434 utters. (vocabulary independent). 1351 spks • Dictionaries: 10000 (VD&I) words

  10. Baseline experiment • Directly using the features (normalized to range 0..1) • Baseline features: • Acoustic log-likelihood (and normalized versions) • Lexical access cost for the 1st candidate • Standard deviation of lexical access costs • Not very good results • Best one with Std Deviation

  11. Baseline distributions  Std Deviation LA Acoustic likelihood(normalized) 

  12. Baseline distributions

  13. Neural Network estimator • Used successfully in preselection list length estimation • Able to combine parameters w/o effort • 3-layer MLP • Wide range of topology alternatives, coding schemes and features: • Direct parameters • Normalized • Lexical Access costs distribution

  14. NN based experiments • Maximum correct classification rates: 70-75% for the three datasets (reasonable, taking into account the preselection rates achieved: 46.95%, 30.14% and 42.47%) • Best single feature: Standard deviation of the lexical access cost measured over the list of the first 10 candidates (0.1% of the dictionary size) • Final system uses 8 parameters (lexical and acoustical-based)

  15. Final distributions  Not using NN Using NN 

  16. Final distributions  Using NN Not using NN 

  17. Additional results • EER: • 30% for PERFDV • 25% for PEIV1000 and PRNOK5TR • Optimum threshold very close to the scale midpoint • Correct rejection rates for given False rejection:

  18. Conclusions • Introduced word-level confidence estimation system based on NNs and a combination of lexical and acoustical features • NN showed to improve results obtained using the features directly • Best parameter is lexical-based and consistent with acoustical-versions reported in the literature (standard deviation is similar to likelihood ratios and n-best related features)

  19. Future work • Extend the comparison of the NN vs non-NN system to all feature set • Extend the work to the verification module (experiments already carried out shows good results) • Extend the approach to CRS (phrase level confidence)

  20. ROC Curves (NN vs. non NN)

  21. Any questions?

More Related