430 likes | 444 Views
This collection of papers explores discriminative training methods for improving speech recognition systems, focusing on large-scale tasks and unreliable transcripts. Techniques like Large Margin Estimation and Classification Error Minimization are discussed and experimentally evaluated.
E N D
ICASSP 2005 SurveyDiscriminative Training (6 papers) Presenter: Jen-Wei Kuo
Outline • Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition – Cambrige University • Discriminative Training of CDHMMs for Maximum Relative Separation Margin– York University • Statistical Performance Analysis of MCE/GPD Learning in Gaussian Classifiers and Hidden Markov Models – BBN • Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts– JHU • Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers– NTT • Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition– Microsoft Speech Lab. NTNU
Discriminative Training of CDHMMs for Maximum Relative Separation Margin Chaojun Liu, Hui Jiang, Xinwei Li York University, Canada ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo
Reference • Large Margin HMMs for Speech Recognition • Xinwei Li, Hui Jiang, Chaojun Liu • York University, Canada • ICASSP’05 - Speech and Audio Processing Applications session Speech Lab. NTNU
Large Margin Estimation (LME) of HMM The constrain can not guarantee the existence of the solution Speech Lab. NTNU
Iterative Localized Optimization • Step 1. Based on current model , choose the support token satisfying above constrains gives the minimum margin. • Step 2. Update model by using GPD • Step 3. If some convergence conditions are not met then go to Step 1. Speech Lab. NTNU
Experimental Results • English E-set vocabulary of OGI ISOLET database Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Large Relative Margin Estimation (LRME) of HMM Speech Lab. NTNU
Large Relative Margin Estimation (LRME) of HMM Speech Lab. NTNU
Experimental Results • English E-set vocabulary of OGI ISOLET database and Alphabet set Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Conclusion • Main Concept: • Criterion • Maximum Large Margin • Maximum Large Relative Margin • Support token • Utterance which has relatively small positive margin Speech Lab. NTNU
Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts Lambert Mathias* Girija Yegnanarayanan+, Juergen Fritsch+ *JHU +Multimodal Technologies, Inc. ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo
Introduction • This paper presents a method for the automatic generation of transcripts from medical reports. • Medical Domain • Unlimited amount of speech data available for each speaker • These speech data have no verbatim transcripts but final reports • Medical final reports • Made by physicians and other healthcare professionals • Grammatical error corrections • Removal of disfluencies and repetitions • Addition of nondictated sentence and paragraph boundaries • Rearranged order of dictated paragraphs • Can still be explored as an information source for generating training transcripts Speech Lab. NTNU
Introduction • Central idea of this paper • Step1. Transform the reports to spoken form transcripts ( Partially Reliable Transcripts, PRT) • Step2. Identify reliable regions in the transcripts • Step3. Apply ML/MMI acoustic training • Propose an approach of frame-based filtering for lattice-based MMI • Step4. The results show that MMI outperforms ML Speech Lab. NTNU
Partially Reliable Transcripts • Step1. Normalize the medical reports to a common format • Step2. Generate a report-specific FSG for all the available medical reports • Step3. Use the normalized medical reports to train a LM • Step4. Generate the orthographic transcripts using the LM and the best AM • Step5. Annotate the orthographic transcripts by aligning it against the corresponding report-specific FSG • Step6. Parse the orthographic transcripts using the report-specific FSG with a robust parser that allows for INS, DEL and SUB • Step7. If the word is an INS, DEL or SUB then mark the frames of underlying phone sequence as “unreliable”, or “reliable” otherwise • Step8. Use the reliable segments to retrain the AMs • Step9. Goto step4. Speech Lab. NTNU
MMI Training with Frame Filtering • Approach 1 • Step1. Mark each are on the MMI training lattices as RELIABLE or UNRELIABLE • Step2. Counts (num and den) are then accumulated only on the RELIABLE arcs • Approach 2 (Frame Filtering) • Step1. Mark each frame as “reliable” or “unreliable” • Step2. Allow for inclusion of partially reliable words in the training Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers Erik McDermott and Shigeru Katagiri NTT Communication Science Laboratories ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo
Introduction • Special features focused in this paper • MCE training with Quickprop optimization • SOLON WFST-based recognizer (designed by NTT) • It uses a time-synchronous beam search strategy and has been applied LM with vocabularies of up to 1.8 million words • Context-dependent model design using decision tree • Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task (about 190 hrs) • Name recognition on 22k names • Word recognition on 30k words Speech Lab. NTNU
Corpus for Name Recognition • Name Recognition (40 hrs from CSJ) • 35500 utterances (39 hrs) for training • Contain 22320 names ( 16547 family names and 5744 given names) • 6428 utterances for testing • Contain OOVs • WFST • Weight-pushing, Network Optimization • 489756 nodes • 1349430 arcs Speech Lab. NTNU
WFST Recognizer • Four strategies to generate denominator statistics for MCE training • Triphone-Loop • Like free syllable recognition in Mandarin • Bigram triphone LM • Full-WFST LM + Flat Transcripts • Full 22k LM (22,320 names in vocabulary) • Represent transcription as a WFST which is by compositing of the full WFST and the transcribed word sequence • Lattice-WFST + Flat Transcripts • The lattice is first generated by MLE-trained model • Faster than Full-WFST (average 800 arcs each v.s. 1349430 arcs) • Lattice-WFST + Rich Transcripts • Add all possible fillers into transcription grammar Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Experimental Results • Use of Lpnorm and N-best incorrect candidates Speech Lab. NTNU
Word Recognition • Word Recognition Corpus and Exp. Results • 154000 utterances (190 hrs) for training • 10 lecture speeches and 130 minutes in total • 30k words in vocabulary • WFST • Trigram LM • 6138702 arcs • MCE Training • Beam search with unigram (about 3-5x RT) • 494845 arcs Speech Lab. NTNU
Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition Bo Liu12, Hui Jiang3, Jian-Lai Zhou1, Ren-Hua Wang2 1Micorsoft Research Asia 2University of Science and Technology of China 3York University ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo
Reference • A Dynamic In-Search Discriminative Training Approach for Large Vocabulary Speech Recognition • Hui Jiang, Olivier Siohan, Frank K. Soong, Chin-Hui Lee • Bell Labs, Lucent Technologies • ICASSP’02 – Discriminative Training in Speech Recognition session Speech Lab. NTNU
Competing Token Collection • For each frame t • For each active word arc w • Perform backtrace to obtain the partial path • HMM alignment • For each HMM m • Calculate the overlap rate • If overlap rate < threshold and Likelihood(m) < Likelihood(Ref) Then m is collected to be a competing token • End • End • End Speech Lab. NTNU
Experimental Results • Corpus • DARPA Communicator task (Travel Reservation Application) Speech Lab. NTNU
Introduction • Discriminative Criterion in Phone Level • Least Phone Competing Tokens Criterion (LPCT) • Given speech segment O and phone a • Competing Token (CT) • True Token (TT) : Speech Lab. NTNU
Off-line Token Collection • Discriminative Criterion in Phone Level • True Token (TT) : • Firstly, the forced-alignment is performed. • Every segment in the reference is treated as a TT. • Competing Token (CT) • Generate word lattice. • At each word arc, phone boundaries are annotated. • Choose phone arcs to be CT or not • 1. max overlap with same phones in reference > threshold • 2. the difference log-likelihood > threshold • 3. add the phone arc (segment and phone id) into CT • LPCT = Token Collection + MCE/GPD Speech Lab. NTNU
Least Phone Competing Tokens Criterion (LPCT) • Experimental Results • Resource Management database Speech Lab. NTNU
Least Phone Competing Tokens Criterion (LPCT) Speech Lab. NTNU
Experimental Results • Switchboard database Speech Lab. NTNU
Experimental Results Speech Lab. NTNU
Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition K. C. Sim and M. J. F. Gales University of Cambridge ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo
Background for Precision Modeling • Problem • How to model the correlation in the feature in that increasing the dimension • Solution • Approximate diagonal covariance matrix is employed • Structured precision matrix approximations SPAM model • R=1 • n=d STC model • d<n<=d(d+1)/2 EMLLT model Speech Lab. NTNU
Research Progress Speech Lab. NTNU