190 likes | 207 Views
The Development of the AMI System for the Transcription of Speech in Meetings. Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals. July 12, 2005 MLMI Edinburgh. Outline.
E N D
The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals July 12, 2005 MLMI Edinburgh
Outline • Multi-site development • Development strategy • Resources • Modelling • System integration • Results • Conclusions Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
AMI ASR around the globe Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Multi-site development • Large vocabulary ASR is complex and requires considerable resources • Split development effort across multiple sites • DICT • LM • CORE • ADAPT • AUDIO-PREPROC • Central storage and compute resources • Communication: frequent telephone conferences internet chat, “working phone calls” (VoIP), multiple workshops, WIKI Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Development paradigm • Resource building • Dictionary • LM • Acoustic data • Resource driven • Bootstrap from conversational telephone speech (CTS) • Generic technology selection • Pick generic techniques with maximum gain • VTLN, HLDA, MPE, CN • Task specific components • Front-ends • Language models Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Resources • Meeting resources “sparse” • Corpora: ICSI, ISL, NIST (LDC,VT) • The AMI corpus (initial parts) • 100 hours of meeting data • Language model data • Broadcast News (220MW) • Web-data - CTS/AMI/Meeting (600MW) • Meetings (ICSI/ISL/NIST/AMI) • CTS (Swbd/Fisher) • … • Dictionary • Edinburgh UNISYN Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Dictionary • Baseline dictionary based upon UNISYN (Fitt, 2000) with 114,876 words • Semi-automatic generation of pronunciations • Part-word pronunciations initially automatically guessed from the existing pronunciations • Automatic CART based letter-to-sound conversion trained from UNISYN • Hand correction/checking of all automatic hypotheses • Words were all converted to British spellings • An additional 11,595 words were added using a combination of automatic and manual generation: • Pronunciation probabilities (estimated from alignment of the training data) Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Vocabulary Source Test data • Out of Vocabulary rates (OOV) with padding to 50k words from general Broadcast News data. • No need for specific vocabulary ! Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Language modelling Source Test data • Interpolated trigram language models on meeting data optimised for each domain (on independent dev data) • Perplexity results • Meeting resource specific outperform general models • Translates into 0.5% abs Word Error Rate improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Acoustic modelling • Standard HMM based framework • Decision tree state clustered triphones • Hidden Markov model toolkit (HTK) • Maximum likelihood training • Approx. 70k Gaussian/Model set • MAP Adaptation from CTS models • Bandwidth problem: CTS is narrowband data (4kHz), meetings are recorded at 8kHz bandwith • Developed MLLR/MAP • Front-end feature transform • SHLDA = Smoothed Heteroscedastic Linear Discriminant Analysis • Typically 1.5% WER improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Speaker/channel adaptation • CMN/CVN (channel) • Vocal tract length normalisation (VTLN) • maximum likelihood • training & test • Typically 3-4% WER gain • MLLR • Mean and variance • Transforms for speech and silence • Typically 1-2% improvement Histograms Warp factors female/male Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Front-ends • Meeting recordings with a variety of source types • Microphone locations • Close talking: head-mounted/lapel • Distant: “arbitrary location”, various array configuration • Requires: speech activity detection, speaker “grouping”, speaker and location tracking. • Objective: Achieve “close-talking” performance with distant microphones • Enhancement type approach for simplicity Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Signal enhancement (cross-talk suppression) LMS echo cancellation Speech activity detection (SAD) Using Multi-Layer Perceptron (MLP) IHM front-end processing Cross-talk suppression Feature extraction x (IHM channel) x’ (enhanced signal) Smoothing parameters (insertion penalty, minimum duration) x’ (36 dim feature vector) Yk (remaining IHM channels) MLP classification Viterbi decoder Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
IHM cross-talk suppression • Multiple-reference LMS adaptive filtering with 256 tap FIR filter • Adaptation is frozen during period of speech activity • Automatic correction for channel timing misalignment Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Multiple distant microphones • Gain Calibration • Simple gain calibration is performed in which the maximum amplitude of each audio channel is normalised. Gain Calibration • Noise Removal • Noise spectrum of each input channel is estimated • A Wiener filter is applied to each channel to remove stationary noise. Noise removal • Delay Estimation • Computed per frame • Scale factors: ratio of energy • Delay: peak finding in cross-correlation Delay Estimation • Beamformer • Beamformer filters using superdirective technique using a noise correlation matrix estimated above Beamformer Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Towards a model set • Model initialisation (WER on ICSI only) • More training data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
System architecture Front-end (IHM/MDM) Modified Audio, Segments, Speaker Info First pass recognition First recognition result Adaptation Lattice generation Word lattices LM Rescoring Final word level result Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Results on rt05seval • CTS-adapted ML models, unadapted, trigram LM (first pass) MDM segmentation provided by ICSI/SRI REF denotes reference segmentation and speaker labels • The performance of the full system on the above AMI subset is 30.9% for IHM and 35.1% for MDM. • BUT: difference on REF remains Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System
Conclusions • Multi-site development! • Competitive ASR system in 10-11 months • Meeting domains inhomogeneous ? • Good improvements with VTLN/SHLDA/MLLR • Pre-processing needs to be sorted ! • Reasonable performance on Seminar data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System