Perceptual Compensation in Automatic Speech Recognition Studies

Automatic Speech Recognition Studies Guy Brown, Amy Beeston and Kalle Palomäki PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Overview • Aims • The articulation index (AI) corpus • Phone recogniser • Results on sir/stir subset of AI corpus • Future plans PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Aims • Aim to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). • Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task. • wider vocabulary • range of reverberation conditions • variety of speech contexts • naturalistic speech, rather than interpolated stimuli • consider phonetic confusions in reverberation in general PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Progress to date • Current work has focused on implementing a baseline ASR system for the articulation index (AI) corpus, which meets the requirements for speech material stated on previous slide. • So far have results for phone recognition on small test set without any ‘constancy’ processing. • Planning evaluation that compares phonetic confusions made by listeners and ASR on the same test. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

The articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania), available via LDC. • Intended for speech recognition in noise experiments similar to those of Fletcher. • Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins et al.: • English (American) • Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”) • Target syllables are embedded in a context sentence drawn from a limited vocabulary PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Details of the AI corpus • Includes all “valid” English diphone (CV, VC) syllables. • Triphone syllables (CVC, CCV, VCC) chosen according to frequency in Switchboard corpus • correlated with syllable frequency in casual conversation. • 12 male speakers, 8 female speakers. • Approximately 2000 syllables common to all speakers. • Small amount (10 min) of conversational data. • All speech data sampled at 16 kHz. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

AI corpus examples • Target syllable preceded by two context words and followed by one context word: • CW1 CW2 SYL CW3 • CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words respectively • Examples: they recognise sir entirely people ponder stir second PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Phone recogniser • Monophone recogniser implemented and trained on the TIMIT corpus. • Based on HTK scripts by Tony Robinson1. • Front-end: speech encoded as 12 cepstral coefficients +energy+deltas+accelerations (39 features). • Cepstral mean normalisation applied. • 3 emitting states per phone model, observations modelled by 20 Gaussian mixtures per state. • Approx 58% phone accuracy on TIMIT test set. 1http://www.cantabResearch.com/HTKtimit.html PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Training and testing • Trained on TIMIT training set. • Really needs adapting to the AI corpus material; work in progress. • Removed allophones from TIMIT labels (as is usual) to give 41 phone set. • Short pause and silence models. • For testing on AI corpus, word-level transcriptions were expanded into phone sequences using Switchboard-ICSI pronunciation dictionary. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Experiments • Initial experiments done with a subset of AI corpus utterances in which the target syllable is “sir” or “stir”. • Small test set of 40 utterances: PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Experiment 1: Fletcher-style paradigm • A recogniser grammar was used in which • The sets of context words CW1, CW2 and CW3 are specified; • Target syllable is any sequence of two or three phones. • Corresponds to task in which listener knows that context words are drawn from a limited set. • Recogniser grammar is a (rather unconventional) mix of word-level and phone-level labels. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Experiment 1: results • Overall 47.5% correct at word level (sir/stir) • Context words not correctly recognised in some cases, leading to knock-on effect on recognition of the target syllable. • Examples: PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Experiment 2: constrained sir/stir • A recogniser grammar was used in which • The sets of context words CW1, CW2 and CW3 are specified; • Target syllable is constrained to “sir” or “stir”; • Canonical pronunciation of “sir” and “stir” is assumed (i.e. “sir” = /s er/ and “stir” = /s t er/) • Corresponds to Watkins-style task, except that context words vary and are drawn from a limited set. • Utterances either presented clean or convolved with the left channel or right channel of the L-shaped room or corridor BRIRs. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Experiment 2: recogniser grammar • Recogniser grammar was $test = SIR | STIR; ( !ENTER $cw1 $cw2 $test $cw3 !EXIT ) with $cw1, $cw2 and $cw3 defined as before. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Results: L-shaped room, left channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Results: L-shaped room, right channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Results: corridor, left channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Results: corridor, right channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Conclusions • The phone recogniser works well when constrained to recogniser “sir”/”stir” only (95% correct). • Recognition rate falls as reverberation increases, as expected. • The fall in performance is not only due to “stir” being reported as “sir”, as expected from human studies. • Some effects of BRIR channel on performance. Right channel of the corridor BRIR is less problematic, most likely due to a strong early reflection in the right channel for the 5m condition. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Plans for next period: experiments • The AI corpus lends itself to experiments in which target and context are varied as in Watkins et al. experiments. • Suggestion: • Compare listener and ASR phone confusions under conditions in which the whole utterance is reverberated, and when reverberation is added to the target syllable only. • Possible problems: • Relatively insensitive design? Will effect of reverberation be sufficient to show up as consistent phone confusions? • Are the contexts long enough? (some contexts as short as 0.5 s). • As shown in baseline studies, recogniser does not necessarily make the same mistakes as human listeners. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

AI corpus sir/stir stimuli Utterances similar to sir/stir format Wider variety of speakers/contexts (but still limited vocabulary) Targets mostly nonsense, but some real words (eg. sir/stir) Reverberated (by Amy) according to sir-stir paradigm Widening sir/stir paradigm towards ASR environment Introduce different stop consonants first: s {t,p,k} ir Look for confusion in place of articulation near-near near-far far-far near-near near-far far-far PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22

Test words from AI corpus We could record our own: sigh, sty, spy, sky (sky is missing) PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23

Questions for Tony • Generally - would this sort of thing work? • Is the initial delay in BRIR kept? • How should the AI corpus signals be level-normalised when mixed reveberation distance is used? • How to control the ordering of stimuli? PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Plans: system development • Currently the ASR system is trained on TIMIT; expect improvement if adapted to the AI corpus material. • Only have word-level transcription for the AI corpus so must obtain phone labels by forced alignment. • We will try the efferent model as a front end for recognition of reverberated speech, however: • it may not be sufficiently general, having been developed/tuned only for the sir/stir task • that said, we have shown elsewhere that efferent suppression is effective in improving ASR performance in additive noise • there is some relationship between the efferent model and successful engineering approaches PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Plans: system development • Current efferent model is not unrelated to engineering approach of Thomas et al. (2008): • “the effect of reverberation is reduced when features are extracted from gain normalized temporal envelopes of long duration in narrow subbands” • Our efferent model also does gain control over long-duration windows (and will work in narrow bands). • The model currently produces a spectral representation but could be modified to give cesptral features for ASR. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Plans: other approaches • Parallel search over room acoustics and word models? • How would context effects be included in such a scheme? • On-line selection of word models trained in dry or reverberant conditions, according to context characteristics? • Recognition within individual bands (i.e. train recogniser for each band and combine posterior probabilities) • May allow modelling of Watkins et al. 8-band results • Performance of multiband systems generally lower than conventional ASR PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Lunch PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING

Perceptual Compensation in Automatic Speech Recognition Studies

Perceptual Compensation in Automatic Speech Recognition Studies

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition: An Overview

Adaptation Techniques in Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Automatic Speech Recognition and Audio Indexing

Automatic Speech Recognition System

Confidence Measures for Automatic Speech Recognition

Automatic Speech Recognition

Automatic Continuous Speech Recognition

Automatic Speech Recognition Studies

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition