270 likes | 415 Views
Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System. Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam Janin Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng
E N D
Further Progress in Meeting Recognition:The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System Andreas Stolcke Xavier Anguera Kofi Boakye Özgür Çetin František Grézl Adam Janin Arindam Mandal Barbara Peskin Chuck Wooters Jing Zheng International Computer Science Institute, Berkeley, CA, USA SRI International, Menlo Park, CA, USA Technical University of Catalonia, Barcelona, Spain Brno University of Technology, Czech Republic University of Washington, Seattle, WA, USA NIST Meeting Recognition Workshop
Overview • Data and Tasks • Audio preprocessing • New MDM delay-sum processing • New IHM cross-talk suppression • SRI decoding architecture • Acoustic modeling • Baseline improvement • Model adaptation + MLP feature adaptation • CTS/BN model combination • Language modeling • Lecture recognition system • Overall results • Conclusions and future work NIST Meeting Recognition Workshop
Data Sets • eval05: NIST RT-05S conference meetings • One previously unseen meeting source (VT) • eval04: NIST RT-04S conference meetings • Unbiased test set • Includes lapel channel in ICSI meeting • dev04a: RT-04S devtest meetings + 2 AMI excerpts • Used for development and tuning • In spite of lapel mics in CMU and LDC meetings • Meeting training data • AMI (35 meetings, 16 hours) • CMU (17 meetings, 11 hours) – Lapel personal mics, no distant mics • ICSI (73 meetings, 74 hours) • NIST (15 meetings, 14 hours) • Acoustic background training data • CTS (Switchboard + Fisher, 2300 hours) • BN (Hub-4 + TDT2 + TDT4, 900 hours) NIST Meeting Recognition Workshop
Evaluation Tasks Conference room meetings: • MDM Multiple distant microphones (primary) • IHM Individual headset microphones (required contrast) • SDM Single distant microphone (optional) Lecture room meetings (in addition to above): • MSLA Multiple source-localization arrays (optional) • MM3A Multiple Mark III microphone arrays (optional) • Available only for CHIL devtest data (single array) Results reported are for conference room unless noted otherwise. (Lecture results at the end.) NIST Meeting Recognition Workshop
Development Strategy: Base System Web data Acoustic Models & Features Language Models Telephone Training Data Telephone Training Transcripts Training Recognition Decoding SRI CTS System “I I think that uh…” Telephone Test Data NIST Meeting Recognition Workshop
Development Strategy: Meeting System Meeting Training Data Meeting Web data Meeting- Adapted Acoustic Models & Features Meeting- Adapted Language Models Proceedings data (Lecture system) CTS/BN Acoustic Models CTS Language Models Training Recognition Preprocessing Decoding SRI CTS System “First on the agenda…” Meeting Test Data Noise Filtering Segmentation Array Processing (MDM Only) NIST Meeting Recognition Workshop
Acoustic Preprocessing Recognition • Distant microphones • Noise reduction using Wiener filtering on all input channels • Delay-sum beamforming of all channels, into single enhanced channel (MDM) • Waveform segmentation (speech-nonspeech HMM decoding) • Segment clustering (for cepstral normalization, unsupervised adaptation) • Close-talking (personal) microphones • No noise reduction (tried it, no gains) • Waveform segmentation • Crosstalk detection and filtering Training • Distant microphones • Eliminate overlapping speech (based on personal mic word alignment times) • Noise filtering • No delay-sum processing • Models trained on all distant channels NIST Meeting Recognition Workshop
Improved MDM Delay-Summing • RT-04S: Delay-summing after segmentation • One Time delay of arrival (TDOA) per segment • RT-05S: Delay-summing before segmentation • Smoothed TDOA estimation every 250ms • SDM channel used as reference for all other channels • GCC-PHAT, window size 500ms • Details in diarization system talk & paper • Results on eval04 (multi-channel meetings only), RT-04S models: • 6.5% relative gain between old and new MDM processing • Mostly from improved delay-sum, not from segmentation NIST Meeting Recognition Workshop
IHM Crosstalk Filtering • Energy-based detection of foreground vs. background speech • Subtract minumum energy (noise floor) from each channel • For each channel, subtract average energy of all other channels • Threshold at zero • Intersect foreground segments with speech/nonspeech detector output • Results using RT-05S eval models: • Crosstalk filter reduces error difference to reference by 1/3 • Mostly by reducing insertion errors • eval05 ICSI result got worse in filtering (20.5 -> 24.5) • eval05 NIST meeting had speakerphone, 3 empty channels: WER: 34.5, Reference: 21.4 • Using Sheffield segmentation on NIST meetings: WER: 28.3 NIST Meeting Recognition Workshop
MFC nonCW Thin lattices PLP CW 3xRT Output “Fast” Decoding Architecture • Lattice rescoring: • 4-gram LM • N-best rescoring: • 4-gram LM • Word duration model • Pause duration model Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output Conf. Network combination Runtime: 3xRT on 3.4GHz Xeon CPU NIST Meeting Recognition Workshop
Full System Architecture MFC nonCW MFC CW MFC nonCW MFC CW 20xRT Output Thin lattices Thick lattices PLP CW PLP CW PLP CW Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output Conf. Network combination 3xRT Output Runtime: 12xRT (for CTS, Gaussian shortlists) 25xRT (on meetings, no Gaussian shortlists) NIST Meeting Recognition Workshop
Acoustic Features and Models • MFCC within- and crossword triphone models • Augmented with 2 x 5 voicing features (5 frames around current frame) • Augmented with 25-dim Tandem/HATS phone posterior features estimated by multilayer perceptron (MLP features) • PLP crossword triphone models • Normalization and adaptation: • CMN + CVN, VTLN • HLDA • CMLLR (SAT) in training and test (except in first decoding) • MLLR with phone-loop in first MFCC and PLP decoding • MLLR cross-adaptation in subsequent steps • Baseline RT-04F models trained on 1400h of CTS data • All Hub-5 (Switchboard, CallHome English) • Every other Fisher utterance (complementary sets for MFCC and PLP models) • Gender-dependent • Also: PLP models trained on 900h of TDT BN data • Gender-independent NIST Meeting Recognition Workshop
Baseline Model Improvements • Compare RT-04S baseline CTS models to RT-04F CTS models • Model improvements: • Added Fisher training data • MLP features • MPE training instead of MMIE • Decision tree-based triphone clustering instead of bottom-up clustering • Improvement on Fisher CTS data: 28% relative WER reduction • Results using RT-05S eval language model: NIST Meeting Recognition Workshop
Acoustic Adaptation • Model adaptation using MAP, Gaussian means-only • MLP feature adaptation: • 3 additional backprop iterations on CTS MLP, using meeting data • Due to lack of time: adapted using only ICSI+CMU+NIST close-talking data • New MLP using for both IHM and MDM feature generation • Model and MLP adaptation give similar gains • Model and MLP adaptation are partly additive NIST Meeting Recognition Workshop
MLP Feature Portability • How well do MLP feature trained on CTS generalize? • MLP features improve system by 2% absolute, 10% relative on RT-04F CTS eval data • Compare effectiveness using meeting-adapted CTS models and MLP features: • Improvement on eval05 comparable to CTS NIST Meeting Recognition Workshop
Combining CTS and BN Base Models • CTS models match meeting domain in speaking style • BN models better matched to acoustic properties • Signal bandwidth • (Some) noisy environments • Distant microphones • Use meeting-adapted BN models in PLP decoding • Substantial gains on distant microphones • Inconsistent gains on close-talking mics (not used in eval system) • Bug: adapted BN models only to male speakers (only 0.1% loss) NIST Meeting Recognition Workshop
Discriminative MAP Adaptation • Preserve discriminative properties of MPE base models in adaptation • Best results using MMIE criterion on training data with i-smoothing of MPE base parameters (MMI-MAP, Povey et al., Eurospeech 2003) • MDM eval system used ML-MAP models (due to lack of time) NIST Meeting Recognition Workshop
Test-time MLLR Improvement • After evaluation, improved unsupervised MLLR • Replaced hand-designed phonetic regression class tree with more standard data-driven decision tree (generated in model training) • Part of Arindam Mandal’s UW thesis work on speaker-dependent MLLR class prediction (Eurospeech 2005) NIST Meeting Recognition Workshop
Conference Meeting LMs • Linearly interpolated mixture N-gram LMs • Multiword bigram for lattice generation • Multiword trigram for lattice decoding • Word-based 4-gram for rescoring All LMs entropy-pruned • Conference meeting LM components • Switchboard CTS transcripts (6.5M words) (2) Fisher CTS (23M) (3) Hub4 and TDT4 BN transcripts (140M) (4) AMI, CMU, ICSI, and NIST meeting transcripts (1M) (5) Web data selected to match meeting (268M) and Fisher (530M) transcripts • Perplexity optimized on held-out 2004 (non-AMI) data • Vocabulary: 54K words • All words in Switchboard, RT-04S meeting transcripts • All non-singletons in Fisher, AMI devtest • OOV rates: 0.40% on eval04, 0.19% on AMI devtest NIST Meeting Recognition Workshop
Lecture Meeting LMs • Similar to conference meeting LM, but • Added TED oral transcripts (0.1M words) • Speech conference proceedings (28M) • Removed Fisher web data • Added web data based on speech conference proceedings (120M) • Vocabulary: added 3781 words from conference proc. • OOV rate on CHIL devtest: 0.18% • Perplexity optimized on CHIL devtest • Jackknifed for development testing • On full devtest for eval system • No CHIL transcripts used for N-gram training NIST Meeting Recognition Workshop
LM Results (see paper for perplexity results) • Conference WER reduced 1.2% compared to 2004 LM, due to • AMI training data • New web data • Additional Fisher data • Lecture WER reduced 4.6% without using CHIL N-gram data • Web data still useful to reduce WER • Conference meetings: 1.0-1.1% • Lecture meetings: 0.7%, redundant with proceedings ? • No significant gains from source-specific LMs • But note best results for CMU & LDC achieved using CTS LM ! NIST Meeting Recognition Workshop
Overall Results: Conference Meetings • Relative WER reduction on eval04 data: • 17.4% for MDM • 16.9% for IHM • (SDM results not directly comparable due to system differences) • Narrowing IHM/MDM gap ? • Only 17% rel. difference on eval05 • But unchanged on eval04, in spite of improved MDM preprocessing and modeling • AMI data different in character, but not much in system performance • Possible issues with array mics • Performed well on “blind” test • Virginia Tech WER better than avg • NIST IHM results due to unusual meeting properties WER by Source (RT-05S) NIST Meeting Recognition Workshop
Lecture System Development • Development on CHIL Jan’ 2005 data (NIST devtest) • No CHIL data used for acoustic training or adaptation • Reused meeting-adapted models • Transcripts used for LM mixture weight estimation (jackknifed testing) • Model score weights not optimized • Cross-talk filtering never tested (just ran it) • Devtest data only had one IHM channel Findings • Best to use a single speaker cluster for distant mic • Lecture speaker dominates and/or clustering is not good enough • Delay-sum processing on tabletop mics worse than single mic with best SNR • SNRs vary greatly between tabletop mics NIST Meeting Recognition Workshop
Lecture Recognition Results • IHM results comparable to conference meetings • Distant mic WER > 10% absolute worse • Lecture room acoustics more challenging than conference room • Error rates comparable to results reported by CHIL team • MDM better than SDM on devtest, but not on eval • MDM uses single mic with best SNR • MSLA little gain on devtest, but works much better on eval • MM3A somewhat better than MSLA on devtest • Where is the eval data, John ? NIST Meeting Recognition Workshop
Conclusions • Considerable progress in meeting recognition: 17% relative WER reduction on conference meetings • Leveraged LVCSR progress in CTS and BN • 2-3% absolute gain on MDM by combining CTS and BN-based models • Improved preprocessing for MDM (delay-sum) and IHM (cross-talk suppression) • Effective adaptation of all modeling components • Features (MLP-based) • Acoustic models • Language models (web data is important) • System generalized well to • New conference meeting type with training data (AMI) • Meeting source & domain without training data (VT) • Lecture task (adapting only language model) NIST Meeting Recognition Workshop
Future Work • Fix the things we didn’t have time for • MMI-MAP for distant mic models • Adapt MLPs for distant mic features • For IHM: forget recognition (already works well); solve the speaker segmentation problem! • Do a better job on lecture recognition • Use TED acoustic data • Get MDM to work better than SDM • Estimate model weights (LM weight, insertion penalty, …) properly • Explore feature mapping techniques (e.g. MLLR) to reduce mismatch of background training data • Adapt model to non-native speakers of Amer. English • Germans, Brits, Scots, … • cf. Arlo Faria’s poster, MLME-05 • More generally, more adaptation to meeting type & content NIST Meeting Recognition Workshop
Thank You! NIST Meeting Recognition Workshop