290 likes | 303 Views
Automatic Speech Recognition Studies. Guy Brown, Amy Beeston and Kalle Palomäki. Overview. Aims The articulation index (AI) corpus Phone recogniser Results on sir/stir subset of AI corpus Future plans. Aims.
E N D
Automatic Speech Recognition Studies Guy Brown, Amy Beeston and Kalle Palomäki PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Overview • Aims • The articulation index (AI) corpus • Phone recogniser • Results on sir/stir subset of AI corpus • Future plans PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Aims • Aim to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR). • Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task. • wider vocabulary • range of reverberation conditions • variety of speech contexts • naturalistic speech, rather than interpolated stimuli • consider phonetic confusions in reverberation in general PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Progress to date • Current work has focused on implementing a baseline ASR system for the articulation index (AI) corpus, which meets the requirements for speech material stated on previous slide. • So far have results for phone recognition on small test set without any ‘constancy’ processing. • Planning evaluation that compares phonetic confusions made by listeners and ASR on the same test. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
The articulation index (AI) corpus • Recorded by Jonathan Wright (University of Pennsylvania), available via LDC. • Intended for speech recognition in noise experiments similar to those of Fletcher. • Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins et al.: • English (American) • Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”) • Target syllables are embedded in a context sentence drawn from a limited vocabulary PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Details of the AI corpus • Includes all “valid” English diphone (CV, VC) syllables. • Triphone syllables (CVC, CCV, VCC) chosen according to frequency in Switchboard corpus • correlated with syllable frequency in casual conversation. • 12 male speakers, 8 female speakers. • Approximately 2000 syllables common to all speakers. • Small amount (10 min) of conversational data. • All speech data sampled at 16 kHz. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
AI corpus examples • Target syllable preceded by two context words and followed by one context word: • CW1 CW2 SYL CW3 • CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words respectively • Examples: they recognise sir entirely people ponder stir second PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Phone recogniser • Monophone recogniser implemented and trained on the TIMIT corpus. • Based on HTK scripts by Tony Robinson1. • Front-end: speech encoded as 12 cepstral coefficients +energy+deltas+accelerations (39 features). • Cepstral mean normalisation applied. • 3 emitting states per phone model, observations modelled by 20 Gaussian mixtures per state. • Approx 58% phone accuracy on TIMIT test set. 1http://www.cantabResearch.com/HTKtimit.html PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Training and testing • Trained on TIMIT training set. • Really needs adapting to the AI corpus material; work in progress. • Removed allophones from TIMIT labels (as is usual) to give 41 phone set. • Short pause and silence models. • For testing on AI corpus, word-level transcriptions were expanded into phone sequences using Switchboard-ICSI pronunciation dictionary. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiments • Initial experiments done with a subset of AI corpus utterances in which the target syllable is “sir” or “stir”. • Small test set of 40 utterances: PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiment 1: Fletcher-style paradigm • A recogniser grammar was used in which • The sets of context words CW1, CW2 and CW3 are specified; • Target syllable is any sequence of two or three phones. • Corresponds to task in which listener knows that context words are drawn from a limited set. • Recogniser grammar is a (rather unconventional) mix of word-level and phone-level labels. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiment 1: recogniser grammar $cw1 = I | YOU | WE | THEY | SOMEONE | NO-ONE | EVERYONE | PEOPLE; $cw2 = SEE | SAW | HEAR | PERCEIVE | THINK | SAY | SAID | SPEAK | PRONOUNCE | WRITE | RECORD | OBSERVE | TRY | UNDERSTAND | ATTEMPT | REPEAT | DESCRIBE | DETECT | DETERMINE | DISTINGUISH | ECHO | EVOKE | PRODUCE | ELICIT | PROMPT | SUGGEST | UTTER | IMAGINE | PONDER | CHECK | MONITOR | RECALL | REMEMBER | RECOGNIZE | REPEAT | REPORT | USE | UTILIZE | REVIEW | SENSE | SHOW | NOTE | NOTICE | SPELL | READ | EXAMINE | STUDY | PROPOSE | WATCH | VIEW | WITNESS; $cw3 = NOW | AGAIN | OFTEN | TODAY | WELL | CLEARLY | ENTIRELY | NICELY | PRECISELY | ANYWAY | DAILY | WEEKLY | YEARLY | HOURLY | MONTHLY | ALWAYS | EASILY | SOMETIME | TWICE | MORE | EVENLY | FLUENTLY | GLADLY | HAPPILY | NEATLY | NIGHTLY | ONLY | PROPERLY | FIRST | SECOND | THIRD | FOURTH | FIFTH | SIXTH | SEVENTH | EIGHTH | NINTH | TENTH | STEADILY | SURELY | TYPICALLY | USUALLY | WISELY; $phn = AA | AE | AH | AO | AW | AX | AY | B | CH | D | DH | DX | EH | ER | EY | F | G | HH | IH | IY | JH | K | L | M | N | NG | OW | OY | P | R | S | SH | T | TH | UH | UW | V | W | Y | Z | ZH; (!ENTER $cw1 $cw2 $phn $phn [$phn] $cw3 !EXIT) PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiment 1: results • Overall 47.5% correct at word level (sir/stir) • Context words not correctly recognised in some cases, leading to knock-on effect on recognition of the target syllable. • Examples: PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiment 2: constrained sir/stir • A recogniser grammar was used in which • The sets of context words CW1, CW2 and CW3 are specified; • Target syllable is constrained to “sir” or “stir”; • Canonical pronunciation of “sir” and “stir” is assumed (i.e. “sir” = /s er/ and “stir” = /s t er/) • Corresponds to Watkins-style task, except that context words vary and are drawn from a limited set. • Utterances either presented clean or convolved with the left channel or right channel of the L-shaped room or corridor BRIRs. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Experiment 2: recogniser grammar • Recogniser grammar was $test = SIR | STIR; ( !ENTER $cw1 $cw2 $test $cw3 !EXIT ) with $cw1, $cw2 and $cw3 defined as before. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Results: L-shaped room, left channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Results: L-shaped room, right channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Results: corridor, left channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Results: corridor, right channel PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Conclusions • The phone recogniser works well when constrained to recogniser “sir”/”stir” only (95% correct). • Recognition rate falls as reverberation increases, as expected. • The fall in performance is not only due to “stir” being reported as “sir”, as expected from human studies. • Some effects of BRIR channel on performance. Right channel of the corridor BRIR is less problematic, most likely due to a strong early reflection in the right channel for the 5m condition. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Plans for next period: experiments • The AI corpus lends itself to experiments in which target and context are varied as in Watkins et al. experiments. • Suggestion: • Compare listener and ASR phone confusions under conditions in which the whole utterance is reverberated, and when reverberation is added to the target syllable only. • Possible problems: • Relatively insensitive design? Will effect of reverberation be sufficient to show up as consistent phone confusions? • Are the contexts long enough? (some contexts as short as 0.5 s). • As shown in baseline studies, recogniser does not necessarily make the same mistakes as human listeners. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
AI corpus sir/stir stimuli Utterances similar to sir/stir format Wider variety of speakers/contexts (but still limited vocabulary) Targets mostly nonsense, but some real words (eg. sir/stir) Reverberated (by Amy) according to sir-stir paradigm Widening sir/stir paradigm towards ASR environment Introduce different stop consonants first: s {t,p,k} ir Look for confusion in place of articulation near-near near-far far-far near-near near-far far-far PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22
Test words from AI corpus We could record our own: sigh, sty, spy, sky (sky is missing) PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23
Questions for Tony • Generally - would this sort of thing work? • Is the initial delay in BRIR kept? • How should the AI corpus signals be level-normalised when mixed reveberation distance is used? • How to control the ordering of stimuli? PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Plans: system development • Currently the ASR system is trained on TIMIT; expect improvement if adapted to the AI corpus material. • Only have word-level transcription for the AI corpus so must obtain phone labels by forced alignment. • We will try the efferent model as a front end for recognition of reverberated speech, however: • it may not be sufficiently general, having been developed/tuned only for the sir/stir task • that said, we have shown elsewhere that efferent suppression is effective in improving ASR performance in additive noise • there is some relationship between the efferent model and successful engineering approaches PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Plans: system development • Current efferent model is not unrelated to engineering approach of Thomas et al. (2008): • “the effect of reverberation is reduced when features are extracted from gain normalized temporal envelopes of long duration in narrow subbands” • Our efferent model also does gain control over long-duration windows (and will work in narrow bands). • The model currently produces a spectral representation but could be modified to give cesptral features for ASR. PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Plans: other approaches • Parallel search over room acoustics and word models? • How would context effects be included in such a scheme? • On-line selection of word models trained in dry or reverberant conditions, according to context characteristics? • Recognition within individual bands (i.e. train recogniser for each band and combine posterior probabilities) • May allow modelling of Watkins et al. 8-band results • Performance of multiband systems generally lower than conventional ASR PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING
Lunch PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING