120 likes | 126 Views
This study aims to test large vocabulary automatic speech recognition (ASR) in the "stir-sir" paradigm using a newly trained English-English ASR system. The experiments involve adapting the ASR models to different conditions and evaluating the free recognition performance. The results suggest the need for further improvements to achieve better performance comparable to human listeners.
E N D
Experiments on ”stir-sir”-paradigm using large vocabulary ASR Kalle Palomäki Adaptive Informatics Research Centre Helsinki University of Technology
Introduction • Aim: Test large vocabulary ASR in stir – sir paradigm • Motivation: Large vocabulary ASR has learned phoneme models close to humans • ASR: a newly trained English-English large vocabulary recogniser • Trained on read Wall street journal articles • Sampling rate 16 kHz
ASR details • Standard features: Mel freq. cepstral coefficients (MFCCs) + power + deltas + accelerations • Triphone HMMs with acoustic likelihood modeled by Gaussian mixture model • Supervised adaptation using constrained maximum likelihood linear regression, CMLLR • Can be formulated as linear feature transformation
Experiments • Three things tested for • Free recognition result • Recognizer chooses in between: ”next_you'll_get_sir_to_click_on” “next_you'll_get_stir_to_click_on” • Temporally averaged log-probability of ”t”
Experiments • Experiment 1: ”dry” models with no adaptation • Experiment 2: ”dry” models adapted to right conditions • Near-near adapted with near-near • Far-far adapted with far-far • Supervised adaptation with utterances at ends of continuum • Experiment 3: "dry” models adapted to both ”near near”, and ”far-far” • Supervised adaptation with utterances at the ends of continuum
Exp. 1: “dry” models, no adaptation • Free recognition: • near-near: “nantz two-a-days so far”, “nursing care so far” • far-far: “nantz th”, “NMS death”, “ “ • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Near near: change in between conditions 08 and 09 • Far-far: everything silence
Exp. 1: “dry” models, adapted to right cond. • Free recognition: • Near-near: “next month though the khon” • Far-far: ”next he’ll throw the khon” • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Near near: change in between conditions 03 and 04 • Far-far: ”sir” all the time
Exp. 1: “dry” models, adapted to both • Free recognition: • Near-near: next month though the khon • Far far: “next month khon” or “nantz khon” • Choose in between “next_you'll_get_sir_to_click_on”, “next_you'll_get_stir_to_click_on” and silence model • Switches in between the sentences oddly
Discussion & Future directions • Currently ”unconvincing” • Poor free recognition performance • Especially poor far-far performance • May be hard to obtain similar sensitivity as human listeners have • Tricks to get around the poor performance • Cooke (2006) uses a priori masks in order to find glimpses of speech • Choose in between two sentences rather than free recogniton • Measure log-prob instead of recogn performance • How to model Compensation which is the main issue