ICASSP 2005 Survey Discriminative Training (6 papers)

ICASSP 2005 SurveyDiscriminative Training (6 papers) Presenter: Jen-Wei Kuo

Outline • Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition – Cambrige University • Discriminative Training of CDHMMs for Maximum Relative Separation Margin– York University • Statistical Performance Analysis of MCE/GPD Learning in Gaussian Classifiers and Hidden Markov Models – BBN • Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts– JHU • Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers– NTT • Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition– Microsoft Speech Lab. NTNU

Discriminative Training of CDHMMs for Maximum Relative Separation Margin Chaojun Liu, Hui Jiang, Xinwei Li York University, Canada ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

Reference • Large Margin HMMs for Speech Recognition • Xinwei Li, Hui Jiang, Chaojun Liu • York University, Canada • ICASSP’05 - Speech and Audio Processing Applications session Speech Lab. NTNU

Large Margin Estimation (LME) of HMM The constrain can not guarantee the existence of the solution Speech Lab. NTNU

Iterative Localized Optimization • Step 1. Based on current model , choose the support token satisfying above constrains gives the minimum margin. • Step 2. Update model by using GPD • Step 3. If some convergence conditions are not met then go to Step 1. Speech Lab. NTNU

Experimental Results • English E-set vocabulary of OGI ISOLET database Speech Lab. NTNU

Experimental Results Speech Lab. NTNU

Large Relative Margin Estimation (LRME) of HMM Speech Lab. NTNU

Experimental Results • English E-set vocabulary of OGI ISOLET database and Alphabet set Speech Lab. NTNU

Conclusion • Main Concept: • Criterion • Maximum Large Margin • Maximum Large Relative Margin • Support token • Utterance which has relatively small positive margin Speech Lab. NTNU

Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts Lambert Mathias* Girija Yegnanarayanan+, Juergen Fritsch+ *JHU +Multimodal Technologies, Inc. ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

Introduction • This paper presents a method for the automatic generation of transcripts from medical reports. • Medical Domain • Unlimited amount of speech data available for each speaker • These speech data have no verbatim transcripts but final reports • Medical final reports • Made by physicians and other healthcare professionals • Grammatical error corrections • Removal of disfluencies and repetitions • Addition of nondictated sentence and paragraph boundaries • Rearranged order of dictated paragraphs • Can still be explored as an information source for generating training transcripts Speech Lab. NTNU

Introduction • Central idea of this paper • Step1. Transform the reports to spoken form transcripts ( Partially Reliable Transcripts, PRT) • Step2. Identify reliable regions in the transcripts • Step3. Apply ML/MMI acoustic training • Propose an approach of frame-based filtering for lattice-based MMI • Step4. The results show that MMI outperforms ML Speech Lab. NTNU

Partially Reliable Transcripts • Step1. Normalize the medical reports to a common format • Step2. Generate a report-specific FSG for all the available medical reports • Step3. Use the normalized medical reports to train a LM • Step4. Generate the orthographic transcripts using the LM and the best AM • Step5. Annotate the orthographic transcripts by aligning it against the corresponding report-specific FSG • Step6. Parse the orthographic transcripts using the report-specific FSG with a robust parser that allows for INS, DEL and SUB • Step7. If the word is an INS, DEL or SUB then mark the frames of underlying phone sequence as “unreliable”, or “reliable” otherwise • Step8. Use the reliable segments to retrain the AMs • Step9. Goto step4. Speech Lab. NTNU

MMI Training with Frame Filtering • Approach 1 • Step1. Mark each are on the MMI training lattices as RELIABLE or UNRELIABLE • Step2. Counts (num and den) are then accumulated only on the RELIABLE arcs • Approach 2 (Frame Filtering) • Step1. Mark each frame as “reliable” or “unreliable” • Step2. Allow for inclusion of partially reliable words in the training Speech Lab. NTNU

Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers Erik McDermott and Shigeru Katagiri NTT Communication Science Laboratories ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

Introduction • Special features focused in this paper • MCE training with Quickprop optimization • SOLON WFST-based recognizer (designed by NTT) • It uses a time-synchronous beam search strategy and has been applied LM with vocabularies of up to 1.8 million words • Context-dependent model design using decision tree • Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task (about 190 hrs) • Name recognition on 22k names • Word recognition on 30k words Speech Lab. NTNU

Corpus for Name Recognition • Name Recognition (40 hrs from CSJ) • 35500 utterances (39 hrs) for training • Contain 22320 names ( 16547 family names and 5744 given names) • 6428 utterances for testing • Contain OOVs • WFST • Weight-pushing, Network Optimization • 489756 nodes • 1349430 arcs Speech Lab. NTNU

WFST Recognizer • Four strategies to generate denominator statistics for MCE training • Triphone-Loop • Like free syllable recognition in Mandarin • Bigram triphone LM • Full-WFST LM + Flat Transcripts • Full 22k LM (22,320 names in vocabulary) • Represent transcription as a WFST which is by compositing of the full WFST and the transcribed word sequence • Lattice-WFST + Flat Transcripts • The lattice is first generated by MLE-trained model • Faster than Full-WFST (average 800 arcs each v.s. 1349430 arcs) • Lattice-WFST + Rich Transcripts • Add all possible fillers into transcription grammar Speech Lab. NTNU

Experimental Results • Use of Lpnorm and N-best incorrect candidates Speech Lab. NTNU

Word Recognition • Word Recognition Corpus and Exp. Results • 154000 utterances (190 hrs) for training • 10 lecture speeches and 130 minutes in total • 30k words in vocabulary • WFST • Trigram LM • 6138702 arcs • MCE Training • Beam search with unigram (about 3-5x RT) • 494845 arcs Speech Lab. NTNU

Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition Bo Liu12, Hui Jiang3, Jian-Lai Zhou1, Ren-Hua Wang2 1Micorsoft Research Asia 2University of Science and Technology of China 3York University ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

Reference • A Dynamic In-Search Discriminative Training Approach for Large Vocabulary Speech Recognition • Hui Jiang, Olivier Siohan, Frank K. Soong, Chin-Hui Lee • Bell Labs, Lucent Technologies • ICASSP’02 – Discriminative Training in Speech Recognition session Speech Lab. NTNU

Competing Token Collection • For each frame t • For each active word arc w • Perform backtrace to obtain the partial path • HMM alignment • For each HMM m • Calculate the overlap rate • If overlap rate < threshold and Likelihood(m) < Likelihood(Ref) Then m is collected to be a competing token • End • End • End Speech Lab. NTNU

Experimental Results • Corpus • DARPA Communicator task (Travel Reservation Application) Speech Lab. NTNU

Introduction • Discriminative Criterion in Phone Level • Least Phone Competing Tokens Criterion (LPCT) • Given speech segment O and phone a • Competing Token (CT) • True Token (TT) : Speech Lab. NTNU

Off-line Token Collection • Discriminative Criterion in Phone Level • True Token (TT) : • Firstly, the forced-alignment is performed. • Every segment in the reference is treated as a TT. • Competing Token (CT) • Generate word lattice. • At each word arc, phone boundaries are annotated. • Choose phone arcs to be CT or not • 1. max overlap with same phones in reference > threshold • 2. the difference log-likelihood > threshold • 3. add the phone arc (segment and phone id) into CT • LPCT = Token Collection + MCE/GPD Speech Lab. NTNU

Least Phone Competing Tokens Criterion (LPCT) • Experimental Results • Resource Management database Speech Lab. NTNU

Least Phone Competing Tokens Criterion (LPCT) Speech Lab. NTNU

Experimental Results • Switchboard database Speech Lab. NTNU

Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition K. C. Sim and M. J. F. Gales University of Cambridge ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

Background for Precision Modeling • Problem • How to model the correlation in the feature in that increasing the dimension • Solution • Approximate diagonal covariance matrix is employed • Structured precision matrix approximations  SPAM model • R=1 • n=d STC model • d<n<=d(d+1)/2  EMLLT model Speech Lab. NTNU

Research Progress Speech Lab. NTNU

ICASSP 2005 Survey Discriminative Training (6 papers)

ICASSP 2005 Survey Discriminative Training (6 papers)

Presentation Transcript

CF Newspaper Readership survey May 2005

Review of ICASSP 2004

Ch 5b: Discriminative Training (temporal model)

Discriminative Training of Markov Logic Networks

ICASSP 2009: Acoustic Model Survey

LECTURE 33: DISCRIMINATIVE TRAINING

A Survey of ICASSP 2013 Language Model

LECTURE 32: DISCRIMINATIVE TRAINING

GWC Post-Training Survey Results

RESIDENTS SURVEY 2005/6: Topline Results

Survey of Robust Speech Techniques in ICASSP 2009

ICASSP 2008 Survey

ICASSP 2014

ICASSP 05

Papers 93 and papers 2005

LECTURE 31: DISCRIMINATIVE TRAINING

Survey ICASSP 2007 Discriminative Training

ICASSP Paper Survey

ICASSP 2006 Robustness Techniques Survey

2005 NAIP Survey

Discriminative Training of Markov Logic Networks

CMGT 578 Best educational training / newtonhelp.com