1 / 19

The Development of the AMI System for the Transcription of Speech in Meetings

The Development of the AMI System for the Transcription of Speech in Meetings. Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals. July 12, 2005 MLMI Edinburgh. Outline.

mberrios
Download Presentation

The Development of the AMI System for the Transcription of Speech in Meetings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Development of the AMI System for the Transcription of Speech in Meetings Thomas Hain, Lukas Burget, John Dines, Iain McCowan, Giulia Garau, Martin Karafiat, Mike Lincoln, Darren Moore, Vincent Wan, Roeland Ordelman, Steve Renals July 12, 2005 MLMI Edinburgh

  2. Outline • Multi-site development • Development strategy • Resources • Modelling • System integration • Results • Conclusions Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  3. AMI ASR around the globe Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  4. Multi-site development • Large vocabulary ASR is complex and requires considerable resources • Split development effort across multiple sites • DICT • LM • CORE • ADAPT • AUDIO-PREPROC • Central storage and compute resources • Communication: frequent telephone conferences internet chat, “working phone calls” (VoIP), multiple workshops, WIKI Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  5. Development paradigm • Resource building • Dictionary • LM • Acoustic data • Resource driven • Bootstrap from conversational telephone speech (CTS) • Generic technology selection • Pick generic techniques with maximum gain • VTLN, HLDA, MPE, CN • Task specific components • Front-ends • Language models Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  6. Resources • Meeting resources “sparse” • Corpora: ICSI, ISL, NIST (LDC,VT) • The AMI corpus (initial parts) • 100 hours of meeting data • Language model data • Broadcast News (220MW) • Web-data - CTS/AMI/Meeting (600MW) • Meetings (ICSI/ISL/NIST/AMI) • CTS (Swbd/Fisher) • … • Dictionary • Edinburgh UNISYN Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  7. Dictionary • Baseline dictionary based upon UNISYN (Fitt, 2000) with 114,876 words • Semi-automatic generation of pronunciations • Part-word pronunciations initially automatically guessed from the existing pronunciations • Automatic CART based letter-to-sound conversion trained from UNISYN • Hand correction/checking of all automatic hypotheses • Words were all converted to British spellings • An additional 11,595 words were added using a combination of automatic and manual generation: • Pronunciation probabilities (estimated from alignment of the training data) Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  8. Vocabulary Source Test data • Out of Vocabulary rates (OOV) with padding to 50k words from general Broadcast News data. • No need for specific vocabulary ! Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  9. Language modelling Source Test data • Interpolated trigram language models on meeting data optimised for each domain (on independent dev data) • Perplexity results • Meeting resource specific outperform general models • Translates into 0.5% abs Word Error Rate improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  10. Acoustic modelling • Standard HMM based framework • Decision tree state clustered triphones • Hidden Markov model toolkit (HTK) • Maximum likelihood training • Approx. 70k Gaussian/Model set • MAP Adaptation from CTS models • Bandwidth problem: CTS is narrowband data (4kHz), meetings are recorded at 8kHz bandwith • Developed MLLR/MAP • Front-end feature transform • SHLDA = Smoothed Heteroscedastic Linear Discriminant Analysis • Typically 1.5% WER improvement Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  11. Speaker/channel adaptation • CMN/CVN (channel) • Vocal tract length normalisation (VTLN) • maximum likelihood • training & test • Typically 3-4% WER gain • MLLR • Mean and variance • Transforms for speech and silence • Typically 1-2% improvement Histograms Warp factors female/male Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  12. Front-ends • Meeting recordings with a variety of source types • Microphone locations • Close talking: head-mounted/lapel • Distant: “arbitrary location”, various array configuration • Requires: speech activity detection, speaker “grouping”, speaker and location tracking. • Objective: Achieve “close-talking” performance with distant microphones • Enhancement type approach for simplicity Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  13. Signal enhancement (cross-talk suppression) LMS echo cancellation Speech activity detection (SAD) Using Multi-Layer Perceptron (MLP) IHM front-end processing Cross-talk suppression Feature extraction x (IHM channel) x’ (enhanced signal) Smoothing parameters (insertion penalty, minimum duration) x’ (36 dim feature vector) Yk (remaining IHM channels) MLP classification Viterbi decoder Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  14. IHM cross-talk suppression • Multiple-reference LMS adaptive filtering with 256 tap FIR filter • Adaptation is frozen during period of speech activity • Automatic correction for channel timing misalignment Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  15. Multiple distant microphones • Gain Calibration • Simple gain calibration is performed in which the maximum amplitude of each audio channel is normalised. Gain Calibration • Noise Removal • Noise spectrum of each input channel is estimated • A Wiener filter is applied to each channel to remove stationary noise. Noise removal • Delay Estimation • Computed per frame • Scale factors: ratio of energy • Delay: peak finding in cross-correlation Delay Estimation • Beamformer • Beamformer filters using superdirective technique using a noise correlation matrix estimated above Beamformer Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  16. Towards a model set • Model initialisation (WER on ICSI only) • More training data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  17. System architecture Front-end (IHM/MDM) Modified Audio, Segments, Speaker Info First pass recognition First recognition result Adaptation Lattice generation Word lattices LM Rescoring Final word level result Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  18. Results on rt05seval • CTS-adapted ML models, unadapted, trigram LM (first pass) MDM segmentation provided by ICSI/SRI REF denotes reference segmentation and speaker labels • The performance of the full system on the above AMI subset is 30.9% for IHM and 35.1% for MDM. • BUT: difference on REF remains Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

  19. Conclusions • Multi-site development! • Competitive ASR system in 10-11 months • Meeting domains inhomogeneous ? • Good improvements with VTLN/SHLDA/MLLR • Pre-processing needs to be sorted ! • Reasonable performance on Seminar data Thomas Hain / July 12, 2005 MLMI – The AMI ASR Meeting System

More Related