1 / 43

ICASSP 2005 Survey Discriminative Training (6 papers)

ICASSP 2005 Survey Discriminative Training (6 papers). Presenter: Jen-Wei Kuo. Outline. Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition – Cambrige University Discriminative Training of CDHMMs for Maximum Relative Separation Margin – York University

ccarden
Download Presentation

ICASSP 2005 Survey Discriminative Training (6 papers)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICASSP 2005 SurveyDiscriminative Training (6 papers) Presenter: Jen-Wei Kuo

  2. Outline • Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition – Cambrige University • Discriminative Training of CDHMMs for Maximum Relative Separation Margin– York University • Statistical Performance Analysis of MCE/GPD Learning in Gaussian Classifiers and Hidden Markov Models – BBN • Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts– JHU • Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers– NTT • Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition– Microsoft Speech Lab. NTNU

  3. Discriminative Training of CDHMMs for Maximum Relative Separation Margin Chaojun Liu, Hui Jiang, Xinwei Li York University, Canada ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

  4. Reference • Large Margin HMMs for Speech Recognition • Xinwei Li, Hui Jiang, Chaojun Liu • York University, Canada • ICASSP’05 - Speech and Audio Processing Applications session Speech Lab. NTNU

  5. Large Margin Estimation (LME) of HMM The constrain can not guarantee the existence of the solution Speech Lab. NTNU

  6. Iterative Localized Optimization • Step 1. Based on current model , choose the support token satisfying above constrains gives the minimum margin. • Step 2. Update model by using GPD • Step 3. If some convergence conditions are not met then go to Step 1. Speech Lab. NTNU

  7. Experimental Results • English E-set vocabulary of OGI ISOLET database Speech Lab. NTNU

  8. Experimental Results Speech Lab. NTNU

  9. Experimental Results Speech Lab. NTNU

  10. Large Relative Margin Estimation (LRME) of HMM Speech Lab. NTNU

  11. Large Relative Margin Estimation (LRME) of HMM Speech Lab. NTNU

  12. Experimental Results • English E-set vocabulary of OGI ISOLET database and Alphabet set Speech Lab. NTNU

  13. Experimental Results Speech Lab. NTNU

  14. Experimental Results Speech Lab. NTNU

  15. Conclusion • Main Concept: • Criterion • Maximum Large Margin • Maximum Large Relative Margin • Support token • Utterance which has relatively small positive margin Speech Lab. NTNU

  16. Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts Lambert Mathias* Girija Yegnanarayanan+, Juergen Fritsch+ *JHU +Multimodal Technologies, Inc. ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

  17. Introduction • This paper presents a method for the automatic generation of transcripts from medical reports. • Medical Domain • Unlimited amount of speech data available for each speaker • These speech data have no verbatim transcripts but final reports • Medical final reports • Made by physicians and other healthcare professionals • Grammatical error corrections • Removal of disfluencies and repetitions • Addition of nondictated sentence and paragraph boundaries • Rearranged order of dictated paragraphs • Can still be explored as an information source for generating training transcripts Speech Lab. NTNU

  18. Introduction • Central idea of this paper • Step1. Transform the reports to spoken form transcripts ( Partially Reliable Transcripts, PRT) • Step2. Identify reliable regions in the transcripts • Step3. Apply ML/MMI acoustic training • Propose an approach of frame-based filtering for lattice-based MMI • Step4. The results show that MMI outperforms ML Speech Lab. NTNU

  19. Partially Reliable Transcripts • Step1. Normalize the medical reports to a common format • Step2. Generate a report-specific FSG for all the available medical reports • Step3. Use the normalized medical reports to train a LM • Step4. Generate the orthographic transcripts using the LM and the best AM • Step5. Annotate the orthographic transcripts by aligning it against the corresponding report-specific FSG • Step6. Parse the orthographic transcripts using the report-specific FSG with a robust parser that allows for INS, DEL and SUB • Step7. If the word is an INS, DEL or SUB then mark the frames of underlying phone sequence as “unreliable”, or “reliable” otherwise • Step8. Use the reliable segments to retrain the AMs • Step9. Goto step4. Speech Lab. NTNU

  20. MMI Training with Frame Filtering • Approach 1 • Step1. Mark each are on the MMI training lattices as RELIABLE or UNRELIABLE • Step2. Counts (num and den) are then accumulated only on the RELIABLE arcs • Approach 2 (Frame Filtering) • Step1. Mark each frame as “reliable” or “unreliable” • Step2. Allow for inclusion of partially reliable words in the training Speech Lab. NTNU

  21. Experimental Results Speech Lab. NTNU

  22. Experimental Results Speech Lab. NTNU

  23. Minimum Classification Error for Large Scale Speech Recognition Tasks using Weighted Finite State Transducers Erik McDermott and Shigeru Katagiri NTT Communication Science Laboratories ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

  24. Introduction • Special features focused in this paper • MCE training with Quickprop optimization • SOLON WFST-based recognizer (designed by NTT) • It uses a time-synchronous beam search strategy and has been applied LM with vocabularies of up to 1.8 million words • Context-dependent model design using decision tree • Corpus of Spontaneous Japanese (CSJ) lecture speech transcription task (about 190 hrs) • Name recognition on 22k names • Word recognition on 30k words Speech Lab. NTNU

  25. Corpus for Name Recognition • Name Recognition (40 hrs from CSJ) • 35500 utterances (39 hrs) for training • Contain 22320 names ( 16547 family names and 5744 given names) • 6428 utterances for testing • Contain OOVs • WFST • Weight-pushing, Network Optimization • 489756 nodes • 1349430 arcs Speech Lab. NTNU

  26. WFST Recognizer • Four strategies to generate denominator statistics for MCE training • Triphone-Loop • Like free syllable recognition in Mandarin • Bigram triphone LM • Full-WFST LM + Flat Transcripts • Full 22k LM (22,320 names in vocabulary) • Represent transcription as a WFST which is by compositing of the full WFST and the transcribed word sequence • Lattice-WFST + Flat Transcripts • The lattice is first generated by MLE-trained model • Faster than Full-WFST (average 800 arcs each v.s. 1349430 arcs) • Lattice-WFST + Rich Transcripts • Add all possible fillers into transcription grammar Speech Lab. NTNU

  27. Experimental Results Speech Lab. NTNU

  28. Experimental Results Speech Lab. NTNU

  29. Experimental Results • Use of Lpnorm and N-best incorrect candidates Speech Lab. NTNU

  30. Word Recognition • Word Recognition Corpus and Exp. Results • 154000 utterances (190 hrs) for training • 10 lecture speeches and 130 minutes in total • 30k words in vocabulary • WFST • Trigram LM • 6138702 arcs • MCE Training • Beam search with unigram (about 3-5x RT) • 494845 arcs Speech Lab. NTNU

  31. Discriminative Training based on the Criterion of Least Phone Competing Tokens for Large Vocabulary Speech Recognition Bo Liu12, Hui Jiang3, Jian-Lai Zhou1, Ren-Hua Wang2 1Micorsoft Research Asia 2University of Science and Technology of China 3York University ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

  32. Reference • A Dynamic In-Search Discriminative Training Approach for Large Vocabulary Speech Recognition • Hui Jiang, Olivier Siohan, Frank K. Soong, Chin-Hui Lee • Bell Labs, Lucent Technologies • ICASSP’02 – Discriminative Training in Speech Recognition session Speech Lab. NTNU

  33. Competing Token Collection • For each frame t • For each active word arc w • Perform backtrace to obtain the partial path • HMM alignment • For each HMM m • Calculate the overlap rate • If overlap rate < threshold and Likelihood(m) < Likelihood(Ref) Then m is collected to be a competing token • End • End • End Speech Lab. NTNU

  34. Experimental Results • Corpus • DARPA Communicator task (Travel Reservation Application) Speech Lab. NTNU

  35. Introduction • Discriminative Criterion in Phone Level • Least Phone Competing Tokens Criterion (LPCT) • Given speech segment O and phone a • Competing Token (CT) • True Token (TT) : Speech Lab. NTNU

  36. Off-line Token Collection • Discriminative Criterion in Phone Level • True Token (TT) : • Firstly, the forced-alignment is performed. • Every segment in the reference is treated as a TT. • Competing Token (CT) • Generate word lattice. • At each word arc, phone boundaries are annotated. • Choose phone arcs to be CT or not • 1. max overlap with same phones in reference > threshold • 2. the difference log-likelihood > threshold • 3. add the phone arc (segment and phone id) into CT • LPCT = Token Collection + MCE/GPD Speech Lab. NTNU

  37. Least Phone Competing Tokens Criterion (LPCT) • Experimental Results • Resource Management database Speech Lab. NTNU

  38. Least Phone Competing Tokens Criterion (LPCT) Speech Lab. NTNU

  39. Experimental Results • Switchboard database Speech Lab. NTNU

  40. Experimental Results Speech Lab. NTNU

  41. Adaptation of Precision Matrix Models on Large Vocabulary Continuous Speech Recognition K. C. Sim and M. J. F. Gales University of Cambridge ICASSP’05 - Discriminative Training Presenter: Jen-Wei Kuo

  42. Background for Precision Modeling • Problem • How to model the correlation in the feature in that increasing the dimension • Solution • Approximate diagonal covariance matrix is employed • Structured precision matrix approximations  SPAM model • R=1 • n=d STC model • d<n<=d(d+1)/2  EMLLT model Speech Lab. NTNU

  43. Research Progress Speech Lab. NTNU

More Related