490 likes | 827 Views
A Tutorial of HMM Tool Kit (HTK). Hongbing Hu Electrical and Computer Engineering Binghamton University 12/02/2008. Outline. The Fundamentals of Speech Recognition An Overview of HTK A Tutorial Example of Using HTK Further Topics on HTK. The Fundamentals of Speech Recognition.
E N D
A Tutorial of HMM Tool Kit (HTK) Hongbing Hu Electrical and Computer Engineering Binghamton University 12/02/2008
Outline • The Fundamentals of Speech Recognition • An Overview of HTK • A Tutorial Example of Using HTK • Further Topics on HTK
The Fundamentals of Speech Recognition • Speech Recognition Architecture • Bayes Rule • Hidden Markov Models (HMMs) • Gaussian Mixture Models • Viterbi Decoding • Continuous Speech Recognition
Speech Recognition Architecture Speech Waveform Feature Extraction Speech Features Classification (Recognition) Recognizer (HMM/NN) Phonemes ai n iy d sil e Words I need a
Bayes Rule Likelihood Prior Posterior Evidence • Prior: Recognition unit (word, phoneme …) probability determined by language modeling • Evidence: A scale factor determined by prior and likelihood to guarantee that the posteriors sum to 1 • Likelihood: Likelihood between a sequence of speech vectors (O) and a candidate unit (wi) , could be estimated by HMMs
Hidden Markov Models (HMMs) • Speech vectors of a unit are generated by a Markov model • The overall probability is calculated as the product of the transition and output probabilities • Likelihood can be approximated by only considering the most likely state sequence
Gaussian Mixture Models • Gaussian Mixture Models • Output distributions are represented by Gaussian Mixture Models • Each observation vector is allowed to be split into a number of independent data streams (S), consisting of multiple mixtures (Ms) • Multivariate Gaussian
Viterbi Decoding • Find the best path through a matrix representing the states of the HMM and the frames of speech
Continuous Speech Recognition • Difficulties • Training data must consist of continuous utterances • The boundaries between units will not be known • Label Training Data by Hand • Embedded Training • Construct a composite HMM through joining the HMMs corresponding to the units of the training utterance
An Overview of HTK • HTK: A toolkit for building Hidden Markov Models • HTK is primarily used for speech recognition research • HTK has also been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing • HTK was originally developed at the Machine Intelligence Laboratory of the Cambridge University Engineering Department (CUED) • Microsoft acquired the license of HTK in 1999, but is still providing it as open source
Features of HTK • World recognized state-of-the-art speech recognition system • HTK can support a variety of different formats • ex : pcm, wav, …, ALIEN(unknown), etc. • Feature extraction: • MFCC, filterbank, PLP, LPC, …, etc. • Very flexible HMM definition • Training • Viterbi(segmentation) • Forward/Backward (Baun-Welch)
Software Architecture • Much of the functionality is built into the library modules • HTK tools are designed to run with a traditional command line style interface • Configuration files control detailed behaviors of tools
HTK Processing Stages • Data Preparation • Training • Testing/Recognition • Analysis
Data Preparation Phase • A set of speech data files and their associated transcriptions are required • Convert the speech data files into an appropriate acoustic feature format (MFCC, LPC) [HCopy, HList] • Convert the associated transcriptions into an appropriate format which consists of the required phone or word labels [HLEd]
HSLAB • Record the speech and manually annotate it with any required transcriptions if the speech needs to be recorded
HCopy • Parameterize the speech waveforms to a variety of acoustic feature formats by setting the appropriate configuration variables MFCC: Mel-Frequency Cepstral Coefficients LPC: Linear Prediction filter Coefficients LPCEPSTRA: LPC cepstral coefficients DISCRETE: Vector quantized data
Other Data Preparation Tools • HList • Check the contents of any speech file and the results of any conversions before processing large quantities of speech data • HLEd • A script-driven text editor used to make the required transformations to label files, for example, the generation of context-dependent label files • HLStats • Gather and display statistical information on the label files • HQuant • Build a VQ codebook in preparation for build discrete probability HMM systems
Training Phase • Prototype HMMs • Define the topology required for each HMM by writing a prototype Definition • HTK allows HMMs to be built with any desired topology • HMM definitions stored as simple text files • All of the HMM parameters (means and variances) are ignored except the transition probabilities
Training Procedure Isolated Word Training [Hinit, HRest] Flat Start Training [HCompV] Embedding Training [HERest]
Isolated Word Training • HInit • Read in all of the bootstrap training data and cuts out all of the examples of a specific phone • Iteratively compute an initial set of parameter value using the segmental k-means training procedure • On the first iteration cycle, the training data are uniformly segmented with respective to its model state sequence, and then means and variances are estimated • On the successive iterations cycles, the uniform segmentation is replaced by Viterbi alignment • HRest • Re-estimate the parameters initially computed by HInit • Baum-Welch re-estimation procedure is used, instead of the segmental k-means training procedure for HInit
Flat Start Training • If the training speech files are not equipped the sub-word-level boundary information, a flat-start training schemecan be used • HCompV • All of the phone models are initialized to be identical and have state means and variances equal to the global speech mean and variance
Embedding Training • HERest • Perform a single Baum-Welch re-estimation of the whole set of the HMMs simultaneously • The corresponding phone models are concatenated and the forward/backward algorithm is used to accumulate the statistics of state occupation, means, variances, etc., • The accumulated statistics are used to re-estimate the HMM parameters
Recognition Phase • HVite • Perform Viterbi-based speech recognition • Dictionary: define how each word is pronounced • Word networks: either simple word loops where any word can follow any other word or directed graphs representing a finite-state task grammar - HBuild and HParse are supplied to create the word networks • Support cross-word triphones and run with multiple tokens to generate lattices containing multiple hypotheses • Configured to rescore lattices and perform forced alignments
Analysis Phase • Evaluate the recognizer performance by comparing the recognition results • HResults • Perform the comparison of recognition results and correct reference transcriptions by using dynamic programming to align them
HResults • Provide an alternative similar to that used in the US NIST scoring package • Provide confusion matrices, speaker-by-speaker breakdowns and time-aligned transcriptions
A Tutorial Example • A voice-operated interface for phone dialing • All of the speech is recorded from scratch • A task grammar and a pronunciation dictionary are defined Dial three three two six five four Dial nine zero four one oh nine Phone Woodland Call Steve young
Processing Steps • Data Preparation • Step 1 – the Task Grammar • Step 2 – the Dictionary • Step 3 – Recording the Data • Step 4 – Creating the Transcription Files • Step 5 – Coding the Data • Training HMMs • Step 6 – Creating Flat Start Monophones • Recognition • Step 7 – Recognizing the Test Data • Step 8 – Evaluating the Performance
Step 1 – the Task Grammar • Specify simple task grammars with the HTK grammar definition language gram This high level representation is provided for user convenience
Step 1 – the Task Grammar (cont.) • A low level notation (HTK Standard Lattice Format ) is required in the actual HTK recognizer • Standard Lattice Format (SLF) • Each word instance and each word-to-word transition is listed explicitly SLF example
Step 2 – the Dictionary • Function words such as A and TO have multiple pronunciations • The entries (SENTSTART, SENTEND) have a silence model sil as their pronunciations and null output symbols dict
Step 3 – Recording the Data • Generate utterances • Record the speech testprompts
Step 4 – Creating the Transcription Files Phone Level Transcription
Step 5 – Coding the Data • Coding can be performed using the tool HCopy configured to automatically convert its input into feature (MFCC) vectors • Configuration file (config)
Step 5 – Coding the Data (cont.) codetr.scp
Step 6 – Creating Flat Start Monophones • Define Prototype HMMs • 3-state left-right with no skips • Proto HMMs files are stored individually (hmm0) • Master Macro File (MMF) ~o <VecSize> 39 <MFCC_0> ~h "w" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM> ~o <VecSize> 39 <MFCC_0> ~h “n" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM> ~o <VecSize> 39 <MFCC_0> ~h “ah" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... …… <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM>
Step 7 – Recognizing the Test Data • Viterbi-based Speech Recognition (HVite) HVite -H hmm0/macros -H hmm0/hmmdefs -S test.scp \ -l ’*’ -i recout.mlf -w wdnet \ -p 0.0 -s 5.0 dict monphones0
Step 8 – Evaluating the Performance • SENT: sentence level statistics • WORD: word level statistics HResults -I testref.mlf monphones0 recount.mlf
Further Topics on HTK • HMMs Training with labeled speech databases • The TIMIT Database: A total of 6300 sentences(630speakers) from 8 major dialect regions of the U.S. • Speech data and labeled Transcriptions are provided • Tied-State Triphone HMMs
HMMs Training with TIMIT • Data Preparation • Step 1 – the Grammar • Step 2 – the Dictionary • Step 3 – Recording the Data • Step 4 – Creating the Transcription Files • Step 5 – Coding the Data • Training HMMs • Step 6 – Creating Monophones • Recognition • Step 7 – Recognizing the Test Data • Step 8 – Evaluating the Performance Speech data and labeled Transcriptions are provided
Step 1 – the Grammar \data\ ngram 1=50 ngram 2=1694 \1-grams: -99.999 !ENTER -3.8430 -1.8048 aa -1.4168 -1.7979 ae -1.6101 …… \2-grams: -0.0001 !ENTER si -2.8093 aa aa -2.8093 aa ao …… stats • HLStats • A bigram language model can be built • HBuild • construct word-loop and word-pair grammars automatically • It can incorporate a statistical bigram language model into a network. HLStats –b stats –o wordlist train.mlf HBuild –n stats wdnet train.mlf
Step 2 – the Dictionary • Phone Level Dictionary • Create phone level dictionary for phonetic recognition • The recognition output is phonemes instead of words sil sil iy iy v v ih ih n n eh eh ix ix f f …… y y en en zh zh !ENTER [] sil !EXIT [] sil dict
Step 6 – Creating Monophones • Isolated Word Training • Initialization with Hinit • Re-estimate the parameters with HRest HInit –S trainlist –I train.mlf –M hmm0 proto (sil, ...) HRest –S trainlist –I train.mlf -u tmvw -w 3 -i 20 –M hmm1 hmm0 (sil, ...)
Triphone HMMs • Context-dependent Triphone HMMs • A powerful subword modelingtechnique because they account for the left and right phoneticcontexts Text: men need … Monphone: sil m ae n sp n iy d . . . Triphone: sil m+ae m-ae+n ae-n sp n+iy n-iy+d iy-d …
Convert the monophone transcriptions to triphone transcriptions • Triphones can be made by simply cloning monophones and re-estimating using triphone transcriptions mktri.led mktri.hed
Tied-State Triphone HMMs • Triphone Clustering • Too many models for the amount of training data available, Some models do not have sufficient data for training • Many triphone contexts are very similar • Decision Tree State Tying
References • “The HTK Book,” Steve Young, et al., Dec. 2006 • http://htk.eng.cam.ac.uk/docs/docs.shtml • “Introduction to HTK Toolkit,” Berlin Chen, 2004 • http://140.122.185.120/PastCourses/2004-TCFST-Audio%20and%20Speech%20Recognition/Slides/SP2004F_Lecture06_Acoustic%20Modeling-HTK%20Tutorial.pdf • HTK Download • http://htk.eng.cam.ac.uk/download.shtml (User registration required)