190 likes | 1.52k Views
HTK Tutorial. Prepared using HTKBook. Software architecture. toolkit for Hidden Markov Modeling optimized for Speech Recognition very flexible and complete very good documentation (HTK Book) Data Preparation Tools Training Tools Recognition Tools Analysis Tool . General concepts.
E N D
HTK Tutorial Prepared using HTKBook
Software architecture • toolkit for Hidden Markov Modeling • optimized for Speech Recognition • very flexible and complete • very good documentation (HTK Book) • Data Preparation Tools • Training Tools • Recognition Tools • Analysis Tool
General concepts • Set of programs with command-line style interface • Each tool has a number of required arguments plus optional arguments. The latter are always prefixed by a minus sign. HFoo -T 1 -f 34.3 -a -s myfile file1 file2 • Options whose names are a capital letter have the same meaning across all tools. For example, the -T option is always used to control the trace output of a HTK tool. • In addition to command line arguments, the operation of a tool can be controlled by parameters stored in a configuration file. For example, if the command HFoo -C config -f 34.3 -a -s myfile file1 file2 is executed, the tool HFoo will load the parameters stored in the configuration file config during its initialisation procedures • the HTK data formats • audio: many common formats plus HTK binary • features: HTK binary • labels: HTK (single or Master Label les) text • models: HTK (single or Master Macro les) text or binary • other: HTK text
Data preparation tools • data manipulation tools: HCopy – parametrze signals HQuant - vector quantization HLEd – label editor HHEd – model editor (master model file) HDMan - dictionary editor HBuild – language model conversion HParse – lattice file preparation (grammar conversion) • data visualization tools: HSLab - speech label manipulation HList - data display and manipulation HSGen – generate sentences out of regular grammar
Training tools The actual training process takes place in stages and it is illustrated in more detail in Fig. 2.3. Firstly, an initial set of models must be created. If there is some speech data available for which the location of the sub-word (i.e. phone) boundaries have been marked, then this can be used as bootstrap data. In this case, the tools HInit and HRest provide isolated word style training using the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit reads in all of the bootstrap training data and cuts out all of the examples of the required phone. It then iteratively computes an initial set of parameter values using a segmental k-means procedure. On the first cycle, the training data is uniformly segmented, each model state is matched with the corresponding data segments and then means and variances are estimated. If mixture Gaussian models are being trained, then a modified form of k-means clustering is used. On the second and successive cycles, the uniform segmentation is replaced by Viterbi alignment. The initial parameter values computed by HInit are then further re-estimated by HRest. Again, the fully labelled bootstrap data is used but this time the segmental k-means procedure is replaced by the Baum-Welch re-estimation procedure described in the previous chapter. When no bootstrap data is available, a so-called flat start can be used. In this case all of the phone models are initialised to be identical and have state means and variances equal to the global speech mean and variance. The tool HCompV can be used for this. Once an initial set of models has been created, the tool HErest is used to perform embedded training using the entire training set. HErest performs a single Baum-Welch re-estimation of the whole set of HMM phone models simultaneously. For each training utterance, the corresponding phone models are concatenated and then the forward-backward algorithm is used to accumulate the statistics of state occupation, means, variances, etc., for each HMM in the sequence. When all of the training data has been processed, the accumulated statistics are used to compute re-estimates of the HMM parameters. HErest is the core HTK training tool. It is designed to process large databases, it has facilities for pruning to reduce computation and it can be run in parallel across a network of machines
Recognition and analysis tools • HVite – performs Viterbi-based speech recognition. HVITE takes as input a network describing the allowable word sequences, a dictionary defining how each word is pronounced and a set of HMMs. It operates by converting the word network to a phone network and then attaching the appropriate HMM definition to each phone instance. Recognition can then be performed on either a list of stored speech files or on direct audio input. As noted at the end of the last chapter, HVITE can support cross-word triphones and it can run with multiple tokens to generate lattices containing multiple hypotheses. It can also be configured to rescore lattices and perform forced alignments. • HResults uses dynamic programming to align the two transcriptions and then count substitution, deletion and insertion errors. Options are provided to ensure that the algorithms and output formats used by HRESULTS are compatible with those used by the US National Institute of Standards and Technology (NIST). As well as global performance measures, HRESULTS can also provide speaker-by-speaker breakdowns, confusion matrices and time-aligned transcriptions. For word spotting applications, it can also compute Figure of Merit (FOM) scores and Receiver Operating Curve (ROC) information.
How to use HTK in 10 easy steps • Step 1. Set the task • Prepare the grammar in the BNF format: • [.] optional • {.} zero or more • (.) block • <.> loop • <<.>> context dep. loop • .|. alternative • Compile grammar to lattice format • D:\htk-3.1\bin.win32\HParse location-grammar lg.lat $location= where is | how to find | how to come to; $ex= sorry | excuse me | pardon; $intro= can you tell me | do you know ; $address= acton town | admirality arch | baker street | bond street| big ben | blackhorse road | buckingham palace | cambridge | canterbury | charing cross road | covent garden | downing street | ealing | edgware road | finchley road | gloucester road | greenwich | heathrow airport | high street | house of parliament | hyde park | kensington | king's cross | leicester square | marble arch | old street | paddington station | piccadilly circus | portobello market | regent's park | thames river | tower bridge | trafalgar square | victoria station | westminster abbey | whitehall | wimbledon | windsor; $end= please; (!ENTER{_SIL_}({$ex} {into} {$location} $address {$end}){_SIL_}!EXIT)
How to use HTK in 10 easy steps • Step 2 – prepare pronunciation dictionary • Find the list of words using in the task – lg.wlist • Prepare dictionary by hand, automatically or using standard pronounciation dictionary (e.g. Beep for British English) • Or use the whole Beep dictionary • where [where] 1.0 w e@ • where [where] 1.0 w e@ r • is [is] 1.0 I z • how [how] 1.0 h aU • admirality [admirality] 1.0 { d m @ r @ l i: t i: • palace [palace] 1.0 p { l I s
How to use HTK in 10 easy steps • Step 3 - Record the Training and Test Data • HTK has a tool for prompts recordings HSLab but it is working under Linux only • Usually other programs used for that • First generate prompts than record them • D:\htk-3.1\bin.win32\HSGen -l -n 200 lg.lat beep.dic > lg.200 1. how to come to baker street _SIL_ !EXIT 2. ealing please _SIL_ !EXIT 3. heathrow airport !EXIT 4. leicester square _SIL_ !EXIT 5. king's cross please _SIL_ !EXIT 6. hyde park _SIL_ !EXIT 7. _SIL_ greenwich please _SIL_ _SIL_ _SIL_ _SIL_ _SIL_ !EXIT 8. old street !EXIT 9. high street _SIL_ _SIL_ _SIL_ _SIL_ !EXIT 10. whitehall !EXIT 11. old street !EXIT 12. canterbury please !EXIT 13. into edgware road !EXIT 14. whitehall _SIL_ !EXIT 15. whitehall _SIL_ !EXIT 16. finchley road please please please _SIL_ !EXIT • Record prompts and store in chosen format: 16 kHz, 16-bit, headerless (?)
How to use HTK in 10 easy steps • Step 4 - Create the Transcription Files • In the HTK all transcription files can be merged into one Master Label File (MLF) • usually it is enough to have word level transcripts • If phone level necessary it can be automatically generated using HLEd #!MLF!# "*/S0001.lab" how to come to baker street "*/S0002.lab" ealing please (etc...)
How to use HTK in 10 easy steps ### hcopy.conf ###input file specific section SOURCEFORMAT = NOHEAD HEADERSIZE = 0 #16kHz corresponds to 0.0625 msec SOURCERATE = 625 ### ###analysis section ### # no DC offset correction ZMEANSOURCE = FALSE # no random noise added ADDDITHER = 0.0 #preemphasis PREEMCOEF = 0.97 #windowing TARGETRATE = 100000 WINDOWSIZE = 250000 USEHAMMING = TRUE #fbank analysis NUMCHANS = 24 LOFREQ = 80 HIFREQ = 7500 #don't take the sqrt: USEPOWER = TRUE #cepstrum calculation NUMCEPS = 12 CEPLIFTER = 22 #energy ENORMALISE = FALSE ESCALE = 1.0 RAWENERGY = FALSE #delta and delta-delta DELTAWINDOW = 2 ACCWINDOW = 2 SIMPLEDIFFS = FALSE ### ###output file specific section ### TARGETKIND = MFCC_D_A_0 TARGETFORMAT = HTK SAVECOMPRESSED = TRUE SAVEWITHCRC = TRUE • Step 5 - Parametrize the Data • Use HCopy: compute MFCC and delta parameters • Use config file to set all the options (hcopy.conf) • HCopy -T 1 -C hcopy.conf -S file.list
~o <VecSize> 39 <MFCC_D_A_0> <StreamInfo> 1 39 ~h "p" <BeginHMM> <NumStates> 5 <State> 2 <NumMixes> 1 <Stream> 1 <Mixture> 1 1.0000 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <NumMixes> 1 <Stream> 1 <Mixture> 1 1.0000 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 4 <NumMixes> 1 <Stream> 1 <Mixture> 1 1.0000 <Mean> 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 39 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <TransP> 5 0.000e+0 1.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0 6.000e-1 4.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 6.000e-1 4.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 6.000e-1 4.000e-1 0.000e+0 0.000e+0 0.000e+0 0.000e+0 0.000e+0 <EndHMM> • Step 6 – Create Monophone HMMs • define a prototype model and clone for all phones
How to use HTK in 10 easy steps • Step 7 – Initialize models • Use Hinit: HInit -S trainlist -H globals -M dir1 proto • Firstly, the Viterbi algorithm is used to find the most likely state sequence corresponding to each training example, then the HMM parameters are estimated. As a side-effect of finding the Viterbi state alignment, the log likelihood of the training data can be computed. Hence, the whole estimation process can be repeated until no further increase in likelihood is obtained. • if no initial data use HCompV for flat start initialization will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance
How to use HTK in 10 easy steps • Step 8 - Isolated Unit Re-Estimation using HRest • Its operation is very similar to HInit except that, it expects the input HMM definition to have been initialised and it uses Baum-Welch re-estimation in place of Viterbi training • whereas Viterbi training makes a hard decision as to which state each training vector was ``generated'' by, Baum-Welch takes a soft decision. This can be helpful when estimating phone-based HMMs since there are no hard boundaries between phones in real speech and using a soft decision may give better results. • HRest -S trainlist -H dir1/globals -M dir2 -l ih -L labs dir1/ih • This will load the HMM definition for /ih/ from dir1, re-estimate the parameters using the speech segments labelled with ih and write the new definition to directory dir2.
How to use HTK in 10 easy steps • Step 9 - Embedded Training using HERest • HERest embedded training simultaneously updates all of the HMMs in a system using all of the training data. • On startup, HERest loads in a complete set of HMM definitions. Every training file must have an associated label file which gives a transcription for that file. Only the sequence of labels is used by HERest, however, and any boundary location information is ignored. Thus, these transcriptions can be generated automatically from the known orthography of what was said and a pronunciation dictionary. • HERest processes each training file in turn. After loading it into memory, it uses the associated transcription to construct a composite HMM which spans the whole utterance. This composite HMM is made by concatenating instances of the phone HMMs corresponding to each label in the transcription. The Forward-Backward algorithm is then applied and the sums needed to form the weighted averages accumulated in the normal way. When all of the training files have been processed, the new parameter estimates are formed from the weighted sums and the updated HMM set is output. • HERest -t 120.0 60.0 240.0 -S trainlist -I labs \ -H dir1/hmacs -M dir2 hmmlist -t : beam limits • Can be used to prepare context-dependent models
How to use HTK in 10 easy steps • Step 10 - Use HVite to recognize utterances and HResults to evaluate recognition rate • D:\htk-3.1\bin.win32\HVite -g -w lg.lat -H wsjcam0.mmf –S test.list -C hvite.conf –i recresults. mlf beep.dic wsjcam0.mlist • A lot of other options to be set (beam width, scale factors, weights, etc.) • On line: D:\htk-3.1\bin.win32\HVite -g -w lg.lat -H wsjcam0.mmf -C live.conf beep.dic wsjcam0.mlist • Statistics of results: HResults -I testref.mlf tiedlist recout.mlf ====================== HTK Results Analysis ============== Ref : testrefs.mlf Rec : recout.mlf ------------------------ Overall Results ----------------- SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855] ========================================================== N = total number, I = insertions, S = substitutions, D = deletions correct: H = N-S-D, %Corr=H/N, Acc=(H-I)/N
Bye Bye • Thanks for your participation!