The Development of the AMI System for the Transcription of Speech in Meetings

The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals July 12, 2005 MLMI Edinburgh

Outline • Multi-site development • Development strategy • Resources • Modelling • System integration • Results • Conclusions Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

AMI ASR around the globe Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Multi-site development • Large vocabulary ASR is complex and requires considerable resources • Split development effort across multiple sites • DICT • LM • CORE • ADAPT • AUDIO-PREPROC • Central storage and compute resources • Communication: frequent telephone conferences internet chat, “working phone calls” (VoIP), multiple workshops, WIKI Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Development paradigm • Resource building • Dictionary • LM • Acoustic data • Resource driven • Bootstrap from conversational telephone speech (CTS) • Generic technology selection • Pick generic techniques with maximum gain • VTLN, HLDA, MPE, CN • Task specific components • Front-ends • Language models Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Resources • Meeting resources “sparse” • Corpora: ICSI, ISL, NIST (LDC,VT) • The AMI corpus (initial parts) • 100 hours of meeting data • Language model data • Broadcast News (220MW) • Web-data - CTS/AMI/Meeting (600MW) • Meetings (ICSI/ISL/NIST/AMI) • CTS (Swbd/Fisher) • … • Dictionary • Edinburgh UNISYN Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Dictionary • Baseline dictionary based upon UNISYN (Fitt, 2000) with 114,876 words • Semi-automatic generation of pronunciations • Part-word pronunciations initially automatically guessed from the existing pronunciations • Automatic CART based letter-to-sound conversion trained from UNISYN • Hand correction/checking of all automatic hypotheses • Words were all converted to British spellings • An additional 11,595 words were added using a combination of automatic and manual generation: • Pronunciation probabilities (estimated from alignment of the training data) Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Vocabulary Source Test data • Out of Vocabulary rates (OOV) with padding to 50k words from general Broadcast News data. • No need for specific vocabulary ! Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Language modelling Source Test data • Interpolated trigram language models on meeting data optimised for each domain (on independent dev data) • Perplexity results • Meeting resource specific outperform general models • Translates into 0.5% abs Word Error Rate improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Acoustic modelling • Standard HMM based framework • Decision tree state clustered triphones • Hidden Markov model toolkit (HTK) • Maximum likelihood training • Approx. 70k Gaussian/Model set • MAP Adaptation from CTS models • Bandwidth problem: CTS is narrowband data (4kHz), meetings are recorded at 8kHz bandwith • Developed MLLR/MAP • Front-end feature transform • SHLDA = Smoothed Heteroscedastic Linear Discriminant Analysis • Typically 1.5% WER improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Speaker/channel adaptation • CMN/CVN (channel) • Vocal tract length normalisation (VTLN) • maximum likelihood • training & test • Typically 3-4% WER gain • MLLR • Mean and variance • Transforms for speech and silence • Typically 1-2% improvement Histograms Warp factors female/male Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Front-ends • Meeting recordings with a variety of source types • Microphone locations • Close talking: head-mounted/lapel • Distant: “arbitrary location”, various array configuration • Requires: speech activity detection, speaker “grouping”, speaker and location tracking. • Objective: Achieve “close-talking” performance with distant microphones • Enhancement type approach for simplicity Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Signal enhancement (cross-talk suppression) LMS echo cancellation Speech activity detection (SAD) Using Multi-Layer Perceptron (MLP) IHM front-end processing Cross-talk suppression Feature extraction x (IHM channel) x’ (enhanced signal) Smoothing parameters (insertion penalty, minimum duration) x’ (36 dim feature vector) Yk (remaining IHM channels) MLP classification Viterbi decoder Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

IHM cross-talk suppression • Multiple-reference LMS adaptive filtering with 256 tap FIR filter • Adaptation is frozen during period of speech activity • Automatic correction for channel timing misalignment Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Multiple distant microphones • Gain Calibration • Simple gain calibration is performed in which the maximum amplitude of each audio channel is normalised. Gain Calibration • Noise Removal • Noise spectrum of each input channel is estimated • A Wiener filter is applied to each channel to remove stationary noise. Noise removal • Delay Estimation • Computed per frame • Scale factors: ratio of energy • Delay: peak finding in cross-correlation Delay Estimation • Beamformer • Beamformer filters using superdirective technique using a noise correlation matrix estimated above Beamformer Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Towards a model set • Model initialisation (WER on ICSI only) • More training data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

System architecture Front-end (IHM/MDM) Modified Audio, Segments, Speaker Info First pass recognition First recognition result Adaptation Lattice generation Word lattices LM Rescoring Final word level result Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Results on rt05seval • CTS-adapted ML models, unadapted, trigram LM (first pass) MDM segmentation provided by ICSI/SRI REF denotes reference segmentation and speaker labels • The performance of the full system on the above AMI subset is 30.9% for IHM and 35.1% for MDM. • BUT: difference on REF remains Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

Conclusions • Multi-site development! • Competitive ASR system in 10-11 months • Meeting domains inhomogeneous ? • Good improvements with VTLN/SHLDA/MLLR • Pre-processing needs to be sorted ! • Reasonable performance on Seminar data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

The Development of the AMI System for the Transcription of Speech in Meetings

The Development of the AMI System for the Transcription of Speech in Meetings

Presentation Transcript

Development of the Nervous System

DEVELOPMENT OF THE NERVOUS SYSTEM

Development of the cardiovascular system

The Role of System Architecture in System Development

Development of the nervous system

Development of the nervous system

Development of the Nervous System

The Development of the Digestive System

annotation of emotions in meetings in the AMI project

Development of the glider system

The development of nervous system

Review of the meetings

DEVELOPMENT OF THE ENDOCRINE SYSTEM

Development of the Nervous System

Development of the Auditory System

DEVELOPMENT OF A SPEECH TRANSCRIPTION TOOL FOR SPEECH AND LANGUAGE THERAPISTS

Development of the nervous system

Development of the Visual System

Development of the Cardiovascular System

Development of the Nervous System

Development of the circulation system