1 / 47

A Tutorial of HMM Tool Kit (HTK)

A Tutorial of HMM Tool Kit (HTK). Hongbing Hu Electrical and Computer Engineering Binghamton University 12/02/2008. Outline. The Fundamentals of Speech Recognition An Overview of HTK A Tutorial Example of Using HTK Further Topics on HTK. The Fundamentals of Speech Recognition.

sahara
Download Presentation

A Tutorial of HMM Tool Kit (HTK)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Tutorial of HMM Tool Kit (HTK) Hongbing Hu Electrical and Computer Engineering Binghamton University 12/02/2008

  2. Outline • The Fundamentals of Speech Recognition • An Overview of HTK • A Tutorial Example of Using HTK • Further Topics on HTK

  3. The Fundamentals of Speech Recognition • Speech Recognition Architecture • Bayes Rule • Hidden Markov Models (HMMs) • Gaussian Mixture Models • Viterbi Decoding • Continuous Speech Recognition

  4. Speech Recognition Architecture Speech Waveform Feature Extraction Speech Features Classification (Recognition) Recognizer (HMM/NN) Phonemes ai n iy d sil e Words I need a

  5. Bayes Rule Likelihood Prior Posterior Evidence • Prior: Recognition unit (word, phoneme …) probability determined by language modeling • Evidence: A scale factor determined by prior and likelihood to guarantee that the posteriors sum to 1 • Likelihood: Likelihood between a sequence of speech vectors (O) and a candidate unit (wi) , could be estimated by HMMs

  6. Hidden Markov Models (HMMs) • Speech vectors of a unit are generated by a Markov model • The overall probability is calculated as the product of the transition and output probabilities • Likelihood can be approximated by only considering the most likely state sequence

  7. Gaussian Mixture Models • Gaussian Mixture Models • Output distributions are represented by Gaussian Mixture Models • Each observation vector is allowed to be split into a number of independent data streams (S), consisting of multiple mixtures (Ms) • Multivariate Gaussian

  8. Viterbi Decoding • Find the best path through a matrix representing the states of the HMM and the frames of speech

  9. Continuous Speech Recognition • Difficulties • Training data must consist of continuous utterances • The boundaries between units will not be known • Label Training Data by Hand • Embedded Training • Construct a composite HMM through joining the HMMs corresponding to the units of the training utterance

  10. An Overview of HTK • HTK: A toolkit for building Hidden Markov Models • HTK is primarily used for speech recognition research • HTK has also been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing • HTK was originally developed at the Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED) • Microsoft acquired the license of HTK in 1999, but is still providing it as open source

  11. Features of HTK • World recognized state-of-the-art speech recognition system • HTK can support a variety of different formats • ex : pcm, wav, …, ALIEN(unknown), etc. • Feature extraction: • MFCC, filterbank, PLP, LPC, …, etc. • Very flexible HMM definition • Training • Viterbi(segmentation) • Forward/Backward (Baun-Welch)

  12. Software Architecture • Much of the functionality is built into the library modules • HTK tools are designed to run with a traditional command line style interface • Configuration files control detailed behaviors of tools

  13. HTK Processing Stages • Data Preparation • Training • Testing/Recognition • Analysis

  14. Data Preparation Phase • A set of speech data files and their associated transcriptions are required • Convert the speech data files into an appropriate acoustic feature format (MFCC, LPC) [HCopy, HList] • Convert the associated transcriptions into an appropriate format which consists of the required phone or word labels [HLEd]

  15. HSLAB • Record the speech and manually annotate it with any required transcriptions if the speech needs to be recorded

  16. HCopy • Parameterize the speech waveforms to a variety of acoustic feature formats by setting the appropriate configuration variables MFCC: Mel-Frequency Cepstral Coefficients LPC: Linear Prediction filter Coefficients LPCEPSTRA: LPC cepstral coefficients DISCRETE: Vector quantized data

  17. Other Data Preparation Tools • HList • Check the contents of any speech file and the results of any conversions before processing large quantities of speech data • HLEd • A script-driven text editor used to make the required transformations to label files, for example, the generation of context-dependent label files • HLStats • Gather and display statistical information on the label files • HQuant • Build a VQ codebook in preparation for build discrete probability HMM systems

  18. Training Phase • Prototype HMMs • Define the topology required for each HMM by writing a prototype Definition • HTK allows HMMs to be built with any desired topology • HMM definitions stored as simple text files • All of the HMM parameters (means and variances) are ignored except the transition probabilities

  19. Training Procedure Isolated Word Training [Hinit, HRest] Flat Start Training [HCompV] Embedding Training [HERest]

  20. Isolated Word Training • HInit • Read in all of the bootstrap training data and cuts out all of the examples of a specific phone • Iteratively compute an initial set of parameter value using the segmental k-means training procedure • On the first iteration cycle, the training data are uniformly segmented with respective to its model state sequence, and then means and variances are estimated • On the successive iterations cycles, the uniform segmentation is replaced by Viterbi alignment • HRest • Re-estimate the parameters initially computed by HInit • Baum-Welch re-estimation procedure is used, instead of the segmental k-means training procedure for HInit

  21. Flat Start Training • If the training speech files are not equipped the sub-word-level boundary information, a flat-start training schemecan be used • HCompV • All of the phone models are initialized to be identical and have state means and variances equal to the global speech mean and variance

  22. Embedding Training • HERest • Perform a single Baum-Welch re-estimation of the whole set of the HMMs simultaneously • The corresponding phone models are concatenated and the forward/backward algorithm is used to accumulate the statistics of state occupation, means, variances, etc., • The accumulated statistics are used to re-estimate the HMM parameters

  23. Recognition Phase • HVite • Perform Viterbi-based speech recognition • Dictionary: define how each word is pronounced • Word networks: either simple word loops where any word can follow any other word or directed graphs representing a finite-state task grammar - HBuild and HParse are supplied to create the word networks • Support cross-word triphones and run with multiple tokens to generate lattices containing multiple hypotheses • Configured to rescore lattices and perform forced alignments

  24. Analysis Phase • Evaluate the recognizer performance by comparing the recognition results • HResults • Perform the comparison of recognition results and correct reference transcriptions by using dynamic programming to align them

  25. HResults • Provide an alternative similar to that used in the US NIST scoring package • Provide confusion matrices, speaker-by-speaker breakdowns and time-aligned transcriptions

  26. A Tutorial Example • A voice-operated interface for phone dialing • All of the speech is recorded from scratch • A task grammar and a pronunciation dictionary are defined Dial three three two six five four Dial nine zero four one oh nine Phone Woodland Call Steve young

  27. Processing Steps • Data Preparation • Step 1 – the Task Grammar • Step 2 – the Dictionary • Step 3 – Recording the Data • Step 4 – Creating the Transcription Files • Step 5 – Coding the Data • Training HMMs • Step 6 – Creating Flat Start Monophones • Recognition • Step 7 – Recognizing the Test Data • Step 8 – Evaluating the Performance

  28. Step 1 – the Task Grammar • Specify simple task grammars with the HTK grammar definition language gram This high level representation is provided for user convenience

  29. Step 1 – the Task Grammar (cont.) • A low level notation (HTK Standard Lattice Format ) is required in the actual HTK recognizer • Standard Lattice Format (SLF) • Each word instance and each word-to-word transition is listed explicitly SLF example

  30. Step 2 – the Dictionary • Function words such as A and TO have multiple pronunciations • The entries (SENTSTART, SENTEND) have a silence model sil as their pronunciations and null output symbols dict

  31. Step 3 – Recording the Data • Generate utterances • Record the speech testprompts

  32. Step 4 – Creating the Transcription Files Phone Level Transcription

  33. Step 5 – Coding the Data • Coding can be performed using the tool HCopy configured to automatically convert its input into feature (MFCC) vectors • Configuration file (config)

  34. Step 5 – Coding the Data (cont.) codetr.scp

  35. Step 6 – Creating Flat Start Monophones • Define Prototype HMMs • 3-state left-right with no skips • Proto HMMs files are stored individually (hmm0) • Master Macro File (MMF) ~o <VecSize> 39 <MFCC_0> ~h "w" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM> ~o <VecSize> 39 <MFCC_0> ~h “n" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM> ~o <VecSize> 39 <MFCC_0> ~h “ah" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM>

  36. Step 6 – Creating Monophones (cont.)

  37. Step 7 – Recognizing the Test Data • Viterbi-based Speech Recognition (HVite) HVite -H hmm0/macros -H hmm0/hmmdefs -S test.scp \ -l ’*’ -i recout.mlf -w wdnet \ -p 0.0 -s 5.0 dict monphones0

  38. Step 8 – Evaluating the Performance • SENT: sentence level statistics • WORD: word level statistics HResults -I testref.mlf monphones0 recount.mlf

  39. Further Topics on HTK • HMMs Training with labeled speech databases • The TIMIT Database: A total of 6300 sentences(630speakers) from 8 major dialect regions of the U.S. • Speech data and labeled Transcriptions are provided • Tied-State Triphone HMMs

  40. HMMs Training with TIMIT • Data Preparation • Step 1 – the Grammar • Step 2 – the Dictionary • Step 3 – Recording the Data • Step 4 – Creating the Transcription Files • Step 5 – Coding the Data • Training HMMs • Step 6 – Creating Monophones • Recognition • Step 7 – Recognizing the Test Data • Step 8 – Evaluating the Performance Speech data and labeled Transcriptions are provided

  41. Step 1 – the Grammar \data\ ngram 1=50 ngram 2=1694 \1-grams: -99.999 !ENTER -3.8430 -1.8048 aa -1.4168 -1.7979 ae -1.6101 …… \2-grams: -0.0001 !ENTER si -2.8093 aa aa -2.8093 aa ao …… stats • HLStats • A bigram language model can be built • HBuild • construct word-loop and word-pair grammars automatically • It can incorporate a statistical bigram language model into a network. HLStats –b stats –o wordlist train.mlf HBuild –n stats wdnet train.mlf

  42. Step 2 – the Dictionary • Phone Level Dictionary • Create phone level dictionary for phonetic recognition • The recognition output is phonemes instead of words sil sil iy iy v v ih ih n n eh eh ix ix f f …… y y en en zh zh !ENTER [] sil !EXIT [] sil dict

  43. Step 6 – Creating Monophones • Isolated Word Training • Initialization with Hinit • Re-estimate the parameters with HRest HInit –S trainlist –I train.mlf –M hmm0 proto (sil, ...) HRest –S trainlist –I train.mlf -u tmvw -w 3 -i 20 –M hmm1 hmm0 (sil, ...)

  44. Triphone HMMs • Context-dependent Triphone HMMs • A powerful subword modelingtechnique because they account for the left and right phoneticcontexts  Text: men need … Monphone: sil m ae n sp n iy d . . . Triphone: sil m+ae m-ae+n ae-n sp n+iy n-iy+d iy-d …

  45. Convert the monophone transcriptions to triphone transcriptions • Triphones can be made by simply cloning monophones and re-estimating using triphone transcriptions mktri.led mktri.hed

  46. Tied-State Triphone HMMs • Triphone Clustering • Too many models for the amount of training data available, Some models do not have sufficient data for training • Many triphone contexts are very similar • Decision Tree State Tying

  47. References • “The HTK Book,” Steve Young, et al., Dec. 2006 • http://htk.eng.cam.ac.uk/docs/docs.shtml • “Introduction to HTK Toolkit,” Berlin Chen, 2004 • http://140.122.185.120/PastCourses/2004-TCFST-Audio%20and%20Speech%20Recognition/Slides/SP2004F_Lecture06_Acoustic%20Modeling-HTK%20Tutorial.pdf • HTK Download • http://htk.eng.cam.ac.uk/download.shtml (User registration required)

More Related