230 likes | 365 Views
Speech recognition in MUMIS. Mirjam Wester, Judith Kessens & Helmer Strik. Intro. Objective: Automatic speech recognition of football commentaries SPEX transcribed two matches for two languages (Dutch and English): England - Germany (Eng-Dld) and Yugoslavia -The Netherlands (Yug-Ned)
E N D
Speech recognition in MUMIS Mirjam Wester, Judith Kessens & Helmer Strik
Intro • Objective: Automatic speech recognition of football commentaries • SPEX transcribed two matches for two languages (Dutch and English): • England - Germany (Eng-Dld) and • Yugoslavia -The Netherlands (Yug-Ned) • Commentaries and stadium noise are mixed
Data Conversion • SPEX transcription: • text grid: • orthographic transcription • chunk alignment; chunk = a segment of speech of about 2 to 3 seconds • CD with one large wav file • Split according to chunk alignments
Examples of data • Yug-Ned Dutch • Yug-Ned English • Eng-Dld Dutch • Eng-Dld English
Statistics English matches have two commentators, Dutch only one. Overlapping segments have been disregarded.
Training Dutch: • Yug-Ned ¾ of CD (19 min speech) • France Telecom Noise Reduction (FTNR) English: • Yug-Ned ¾ of CD (28 min speech) • FTNR For more information on France Telecom Noise Reduction tool see: B. Noé, J. Sienel, D. Jouvet, L. Mauuary, L. Boves, J. de Veth & F. de Wet “Noise Reduction for Noise Robust Feature Extraction for Distributed Speech Recognition”. In Proc. of Eurospeech ’01
Test Dutch: • Yug-Ned ¼ of CD • 626 chunks, 1577 words • lexicon and language model based on complete Yug-Ned match English: • Yug-Ned ¼ of CD • 636 chunks, 2641 words • lexicon and language model based on complete Yug-Ned match
Dutch – Polyphone • Data is phonetically rich sentences • Phone models were trained on: • Polyphone all speakers • Polyphone male speakers • Polyphone male speakers + MUMIS noise • Polyphone as bootstrap for segmentation of MUMIS material
Cross tests (Dutch & English) Cross-tests: • train on ¾ Yug-Ned test on ¼ Eng-Dld • train on ¾ Eng-Dld test on ¼ Yug-Ned
MUMIS models (Dutch) Yug-Ned test Eng-Dld test
MUMIS models (English) Yug-Ned test Eng-Dld test
MUMIS models (Dutch+English) Yug-Ned test Eng-Dld test
Function words vs content words word type Dutch data English data
Discussion • WERs are high • Noise? • FTNR leads to lower SNR, but WERs do not improve substantially • Not enough training data? • Polyphone for training/bootstrapping does not lead to lower WERs than training on MUMIS data • Noisifying Polyphone with MUMIS gives encouraging results
Discussion continued • Function words comprise ± 50% of the data, and cause great deal of the errors • Names are recognized very well • Function words not necessary for information extraction (?)
Future work • Steps to noise robust speech recognition: • model/speaker adaptation • combinations of noisified Polyphone models and FTNR • Other issues: • transcription of more data • English, Dutch and German • preference specific games? radio? TV? • generic football specific language model • confidence measures?
Future work continued Questions: • What type of output from ASR is needed? • word-graph • n-best list • top of the list • word spotting? only content words? • For research purposes: is it possible to obtain data that has not been mixed (noise + commentary)?