150 likes | 280 Views
Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International, Menlo Park, CA. The SRI 2006 Spoken Term Detection System. Outline. STD system overview STT systems BNews system description CTS system description
E N D
Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International, Menlo Park, CA The SRI 2006 Spoken Term Detection System
STD-06 Workshop Outline • STD system overview • STT systems • BNews system description • CTS system description • ConfMtg system description • Indexing • N-gram index from word lattices • NNet based posterior estimation • Retrieval • Time and memory requirements • ATWV Results • Future work
STD-06 Workshop SRI STD System Search Terms Audio N-gram Index with posteriors Word Lattices INDEXER RETRIEVER STT Terms with Times and Probabilities Indexing step
STD-06 Workshop English BN STT System • Single front-end : PLP (52 39 dim) • HLDA, feature-space SAT • Gender-independent acoustic modeling • Decision-tree clustered within-word and cross-word triphones • MLE followed by alternating MPE-MMIE acoustic training • Acoustic training: Hub4, TDT2+TDT4+TDT4a, BNr1234 subset • MLE training: 3300 hours, MPE training: 1700 hours • 2500 x 200 Gaussian for nonCW triphones • 3000 x 160 Gaussians for CW triphones • Word bigram and 5-gram LMtrained on Hub4, TDT, BNr1234 transcripts, Hub4 LM training data, and NABN (cutoff date Nov. 30, 2003) • 62k words, 29M bigrams, 27M trigrams, 15M 4-grams, 2.4M 5-grams • Duration rescoring (word-specific phone durations) • Two-pass decoding • First decoding stage unadapted nonCW model with bigram LM • Adapted CW models to nonCW output after 5-gram LM and duration model lattice rescoring • Lattice constrained decoding with MLLR adapted, SAT, cw model
STD-06 Workshop English BN STT System PLP MPE CW Adapted 5-gram Lattices PLP MPE nonCW 2-gram Lattices 4-gram Lattices Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output • Runtimes: • 2.5xRT for unadapted lattices • 5.4xRT for adapted lattices • ~10% relative WER improvement after adaptation • Both decoding stages use shortlists.
STD-06 Workshop English CTS STT System • Two front-ends: • MFCC + voicing + MLP-features (52 + 10 + 25 39 + 25 dim) • PLP (52 39 dim) • HLDA, feature-space SAT • Gender-dependent acoustic modeling • Decision-tree clustered within-word and cross-word triphones • MLE followed by alternating MPE-MMIE acoustic training • Acoustic training: all Hub5 + Fisher training • 2500 x 128 x 2 Gaussians for nonCW triphones • 3000 x 128 x 2 Gaussians for CW triphones • Prosodic rescoring (word-specific phone durations, pause trigram) • Word bigram and 4-gram LM • Interpolated + pruned LM trained on CTS, BN, and Web data • 48k words, 16M bigrams, 16M trigrams, 12M 4grams • First lattice generation uses phone-loop MLLR nonCW MFCC and 2-gram LM • Second constrained lattice generation uses cross-adapted CW SAT PLP models.
STD-06 Workshop English Meeting STT System[Stolcke et al., MLMI’05; Janin et al., MLMI’06] • Based on CTS system architecture (2-pass system) • Combination of CTS (narrow-band) and BN (wide-band) base models • Acoustic models adapted to distant-mic meeting recordings using MMI-MAP • MLP features adapted for meeting recordings by incremental training • Mixture language model trained on meetings, CTS, and Web data • System used in RT-06S meeting evaluation, co-developed with ICSI
STD-06 Workshop English CTS & Confmtg STT Systems Adapted 4-gram lattices MFCC-MLP MPE nonCW PLP MPE CW 2-gram Lattices 3-gram Lattices Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output • CTS runtime: • 1.8xRT for unadapted lattices • 2.5xRT for adapted lattices • Confmtg runtime: • 5.4xRT for unadapted lattices • 6.8xRT for adapted lattices • CTS system uses Gaussian shortlists in first pass only • Confmtg system does not use shortlists.
eval02 eval03 STD-dev06 BN 10.7% 10.5% 23.2% eval02 eval03 dev04 STD-dev06 CTS 23.7% 24.0% 17.0% 17.4% dev04 eval04s STD-dev06 Confmtg 36.9% 37.2% 44.2% English STT Result Summary (WER) • STD-dev06 WER measured using references constructed from RTTM files • Systematic differences compared to standard STT references • For example, BN scoring does not exclude commercial segments • Note: STT systems were not especially tuned for STD; used configurations inherited from STT evaluations. STD-06 Workshop
STD-06 Workshop Indexing of Word Lattices • SRILM lattice-tool dumps all word 1-grams to 5-grams in lattices, along with side information • Posterior probabilities based on normalized recognizer scores • Start/end times, channel, waveform name • 0.5s time tolerance to merge same N-grams with different times • Pronunciations (to detect OOV words, not used yet) • N-grams with posterior < 0.001 are omitted to keep index size reasonable • Index = term occurrence table sorted by N-gram • Indexing function incorporated in SRILM release 1.5.1 • Lattice-tool –write-ngram-index option • Downloadable from www.speech.sri.com/projects/srilm/
STD-06 Workshop Score Calibration • Neural net maps posteriors to unbiased STD scores • Input features used: audio source (bnews/cts/confmtg), LM joint probability, LM N-gram length, #words, duration, lattice posterior • Used LnkNet software for training MLP to predict correctness of hypothesized term (1 hidden layer with 10 nodes) • Cross entropy objective function • Neural net trained using the dev06 term list • Training on raw data improved Occurrence Weighted Value, not A-Term Weighted Value • Also required re-tuning the posterior threshold. • Resample training data to approximate ATWV • Downsample/upsample within occurrences of each term to have equal number of training samples for each term. • Posterior threshold 0.5 ended up being optimal for ATWV (at least on the training data).
STD-06 Workshop Searching & Retrieval • Convert the search terms into a sorted list • Run the Unix “join” command between the index list obtained in indexing and the term list • YES/NO decision based on the posterior threshold 0.5 • Run time almost independent of the size of the search list (depends on the index size)
STD-06 Workshop BN (3h) CTS (3h) Confmtg (2h) Indexing STT run time 58560 s 26760 s 40440 s Index from lattice 493 s NNet run time 2711 s Index size (# terms/MB) 944K 74Mb 602K 37Mb 530K 37Mb Search time needed for all terms 13 s Time and Memory Requirements • The system was run on 3GB, 3.4 GHz Intel hyperthreading CPU • Both index size and search time can be significantly reduced if we keep only candidates with high posterior • STT runtimes were incorrectly measured in submitted sysdesc.
STD-06 Workshop Occ.WV/ATWV Thres. dev06 dryrun06 Extra dev eval06 BN No NNet With NNet 0.3 0.5 0.914/0.850 0.914/0.865 0.906/0.802 0.905/0.818 0.887/0.801 0.889/0.817 --- / 0.824 CTS No NNet With NNet 0.3 0.5 0.881/0.692 0.881/0.714 0.860/0.615 0.860/0.660 0.792/0.681 0.800/0.712 --- / 0.665 Confmtg No NNet With NNet 0.3 0.5 0.585/0.275 0.515/0.427 0.566/0.205 0.491/0.358 0.631/0.462 0.536/0.461 --- / 0.255 All No NNet With NNet 0.3 0.5 0.821/0.787 0.804/0.817 0.802/0.700 0.782/0.739 0.790/0.687 0.784/0.718 STD Results • Extra dev consists of RT02, RT03 (BN+CTS), dev04 (CTS+ConfMtg), RT04s (ConfMtg) • Difficult to debug eval06 (no references were given), but the result on meetings seems much lower than on dev sets. • Possibly overtrained neural net on meetings condition.
STD-06 Workshop Future Work • Current system does not cover detection of terms with OOVs. Possible approaches: • Map the unknown search terms to the known vocabulary (OGI work, gave about 2-3% improvement on BNews). • Use of phone recognition and phone-based indexing for OOVs • Hybrid word+graphone recognizer outputs both words and “graphone” units that can match OOVs (Bisani & Ney 2005) • Improve the score mapper • Bigger devset needed to avoid overtraining • Other models (decision tree, logistic regression) • Found some mismatch between ASR vocabulary and term lists. Apply normalization rules to fix common problems (found about 0.3% relative improvement with few simple rules) • Tune STT systems for indexing speed