The SRI 2006 Spoken Term Detection System

Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International, Menlo Park, CA The SRI 2006 Spoken Term Detection System

STD-06 Workshop Outline • STD system overview • STT systems • BNews system description • CTS system description • ConfMtg system description • Indexing • N-gram index from word lattices • NNet based posterior estimation • Retrieval • Time and memory requirements • ATWV Results • Future work

STD-06 Workshop SRI STD System Search Terms Audio N-gram Index with posteriors Word Lattices INDEXER RETRIEVER STT Terms with Times and Probabilities Indexing step

STD-06 Workshop English BN STT System • Single front-end : PLP (52  39 dim) • HLDA, feature-space SAT • Gender-independent acoustic modeling • Decision-tree clustered within-word and cross-word triphones • MLE followed by alternating MPE-MMIE acoustic training • Acoustic training: Hub4, TDT2+TDT4+TDT4a, BNr1234 subset • MLE training: 3300 hours, MPE training: 1700 hours • 2500 x 200 Gaussian for nonCW triphones • 3000 x 160 Gaussians for CW triphones • Word bigram and 5-gram LMtrained on Hub4, TDT, BNr1234 transcripts, Hub4 LM training data, and NABN (cutoff date Nov. 30, 2003) • 62k words, 29M bigrams, 27M trigrams, 15M 4-grams, 2.4M 5-grams • Duration rescoring (word-specific phone durations) • Two-pass decoding • First decoding stage unadapted nonCW model with bigram LM • Adapted CW models to nonCW output after 5-gram LM and duration model lattice rescoring • Lattice constrained decoding with MLLR adapted, SAT, cw model

STD-06 Workshop English BN STT System PLP MPE CW Adapted 5-gram Lattices PLP MPE nonCW 2-gram Lattices 4-gram Lattices Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output • Runtimes: • 2.5xRT for unadapted lattices • 5.4xRT for adapted lattices • ~10% relative WER improvement after adaptation • Both decoding stages use shortlists.

STD-06 Workshop English CTS STT System • Two front-ends: • MFCC + voicing + MLP-features (52 + 10 + 25  39 + 25 dim) • PLP (52  39 dim) • HLDA, feature-space SAT • Gender-dependent acoustic modeling • Decision-tree clustered within-word and cross-word triphones • MLE followed by alternating MPE-MMIE acoustic training • Acoustic training: all Hub5 + Fisher training • 2500 x 128 x 2 Gaussians for nonCW triphones • 3000 x 128 x 2 Gaussians for CW triphones • Prosodic rescoring (word-specific phone durations, pause trigram) • Word bigram and 4-gram LM • Interpolated + pruned LM trained on CTS, BN, and Web data • 48k words, 16M bigrams, 16M trigrams, 12M 4grams • First lattice generation uses phone-loop MLLR nonCW MFCC and 2-gram LM • Second constrained lattice generation uses cross-adapted CW SAT PLP models.

STD-06 Workshop English Meeting STT System[Stolcke et al., MLMI’05; Janin et al., MLMI’06] • Based on CTS system architecture (2-pass system) • Combination of CTS (narrow-band) and BN (wide-band) base models • Acoustic models adapted to distant-mic meeting recordings using MMI-MAP • MLP features adapted for meeting recordings by incremental training • Mixture language model trained on meetings, CTS, and Web data • System used in RT-06S meeting evaluation, co-developed with ICSI

STD-06 Workshop English CTS & Confmtg STT Systems Adapted 4-gram lattices MFCC-MLP MPE nonCW PLP MPE CW 2-gram Lattices 3-gram Lattices Legend Decoding/rescoring step Hyps for MLLR or output Lattice generation/use Lattice or 1-best output • CTS runtime: • 1.8xRT for unadapted lattices • 2.5xRT for adapted lattices • Confmtg runtime: • 5.4xRT for unadapted lattices • 6.8xRT for adapted lattices • CTS system uses Gaussian shortlists in first pass only • Confmtg system does not use shortlists.

eval02 eval03 STD-dev06 BN 10.7% 10.5% 23.2% eval02 eval03 dev04 STD-dev06 CTS 23.7% 24.0% 17.0% 17.4% dev04 eval04s STD-dev06 Confmtg 36.9% 37.2% 44.2% English STT Result Summary (WER) • STD-dev06 WER measured using references constructed from RTTM files • Systematic differences compared to standard STT references • For example, BN scoring does not exclude commercial segments • Note: STT systems were not especially tuned for STD; used configurations inherited from STT evaluations. STD-06 Workshop

STD-06 Workshop Indexing of Word Lattices • SRILM lattice-tool dumps all word 1-grams to 5-grams in lattices, along with side information • Posterior probabilities based on normalized recognizer scores • Start/end times, channel, waveform name • 0.5s time tolerance to merge same N-grams with different times • Pronunciations (to detect OOV words, not used yet) • N-grams with posterior < 0.001 are omitted to keep index size reasonable • Index = term occurrence table sorted by N-gram • Indexing function incorporated in SRILM release 1.5.1 • Lattice-tool –write-ngram-index option • Downloadable from www.speech.sri.com/projects/srilm/

STD-06 Workshop Score Calibration • Neural net maps posteriors to unbiased STD scores • Input features used: audio source (bnews/cts/confmtg), LM joint probability, LM N-gram length, #words, duration, lattice posterior • Used LnkNet software for training MLP to predict correctness of hypothesized term (1 hidden layer with 10 nodes) • Cross entropy objective function • Neural net trained using the dev06 term list • Training on raw data improved Occurrence Weighted Value, not A-Term Weighted Value • Also required re-tuning the posterior threshold. • Resample training data to approximate ATWV • Downsample/upsample within occurrences of each term to have equal number of training samples for each term. • Posterior threshold 0.5 ended up being optimal for ATWV (at least on the training data).

STD-06 Workshop Searching & Retrieval • Convert the search terms into a sorted list • Run the Unix “join” command between the index list obtained in indexing and the term list • YES/NO decision based on the posterior threshold 0.5 • Run time almost independent of the size of the search list (depends on the index size)

STD-06 Workshop BN (3h) CTS (3h) Confmtg (2h) Indexing STT run time 58560 s 26760 s 40440 s Index from lattice 493 s NNet run time 2711 s Index size (# terms/MB) 944K 74Mb 602K 37Mb 530K 37Mb Search time needed for all terms 13 s Time and Memory Requirements • The system was run on 3GB, 3.4 GHz Intel hyperthreading CPU • Both index size and search time can be significantly reduced if we keep only candidates with high posterior • STT runtimes were incorrectly measured in submitted sysdesc.

STD-06 Workshop Occ.WV/ATWV Thres. dev06 dryrun06 Extra dev eval06 BN No NNet With NNet 0.3 0.5 0.914/0.850 0.914/0.865 0.906/0.802 0.905/0.818 0.887/0.801 0.889/0.817 --- / 0.824 CTS No NNet With NNet 0.3 0.5 0.881/0.692 0.881/0.714 0.860/0.615 0.860/0.660 0.792/0.681 0.800/0.712 --- / 0.665 Confmtg No NNet With NNet 0.3 0.5 0.585/0.275 0.515/0.427 0.566/0.205 0.491/0.358 0.631/0.462 0.536/0.461 --- / 0.255 All No NNet With NNet 0.3 0.5 0.821/0.787 0.804/0.817 0.802/0.700 0.782/0.739 0.790/0.687 0.784/0.718 STD Results • Extra dev consists of RT02, RT03 (BN+CTS), dev04 (CTS+ConfMtg), RT04s (ConfMtg) • Difficult to debug eval06 (no references were given), but the result on meetings seems much lower than on dev sets. • Possibly overtrained neural net on meetings condition.

STD-06 Workshop Future Work • Current system does not cover detection of terms with OOVs. Possible approaches: • Map the unknown search terms to the known vocabulary (OGI work, gave about 2-3% improvement on BNews). • Use of phone recognition and phone-based indexing for OOVs • Hybrid word+graphone recognizer outputs both words and “graphone” units that can match OOVs (Bisani & Ney 2005) • Improve the score mapper • Bigger devset needed to avoid overtraining • Other models (decision tree, logistic regression) • Found some mismatch between ASR vocabulary and term lists. Apply normalization rules to fix common problems (found about 0.3% relative improvement with few simple rules) • Tune STT systems for indexing speed

The SRI 2006 Spoken Term Detection System

The SRI 2006 Spoken Term Detection System

Presentation Transcript

The IBM 2006 Spoken Term Detection System

Spoken Dialogue Systems: System Overview

Leprosy Detection Rate, 2006

Rapid and Accurate Spoken Term Detection

Spoken Dialogue System Architecture

Using Conversational Word Bursts in Spoken Term Detection

Intrusion Detection System

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Spoken Term Detection Evaluation Overview

DSP2006 Term-Project DTMF Detection

2006-2006 First Term Exam

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION

Spoken Dialog System Architecture

2006 Long-Term RFP

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

Rapid and Accurate Spoken Term Detection

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone

QUALITY ASSESSMENT OF SEARCH TERMS IN SPOKEN TERM DETECTION