470 likes | 504 Views
Discover the benefits and in-depth features of HTK - a world-recognized tool supporting various input formats and speech recognition technologies. Explore its detailed features like feature extraction, model refinement, adaptation techniques, and more. Learn how to define acoustic units, set up HMM structure, conduct training, and evaluate performance. Dive into tasks like phone dialing and free-syllable decoding, and understand database preparation processes for speech recognition projects.
E N D
Benefits of HTK • World recognized state-of-the-art speech recognition system • Support a variety of different input formats • Support different features • Support almost all common speech recognition technologies
Detail features of HTK • HTK can support a variety of different formats ex : pcm, wav, …, ALIEN(unknown), etc. • Feature extraction: • MFCC, filterbank, PLP, LPC, …, etc. • Very free HMM definition • Training • Viterbi (segmentation) • Forward/Backward (Baun-Welch) • Single model re-estimation (change feature)
Detail features of HTK • HMM system refinement • Context-dependent model • Parameter tying/clustering • Regression class tree (MLLR) • Language • Word grammar and network • Bigram language model • Decoding • Evaluate recognition results • Forced alignment • NBest lists/lattices
Detail features of HTK • HMM adaptation • MLLR/Regression Tree • MAP • Mean/variance
HTK procedures • Data/Setting preparation • Define Acoustic units (phone table) • Define Dictionary (word) • Define grammar/network • Collect speech database • Generate transcription • Feature Extraction • Set configuration file for MFCC feature extraction • Prepare Script files (corpus file) • Define HMMs structure (prototype) • Training HMM models • Prepare Script files (corpus file) • Set configuration file for training, recognition, …,etc. • Flat start (uniform segmentation) • Viterbi search (forced alignment : segmentation ) • Recognition/Performance Evaluation • Viterbi search
HMM/Data Setting • Phone Table • Dictionary • Grammar Rule • Define HMMs
Define Acoustic Units (Phone table) • Using our traditional 100 RCD initials 40 CI finals ……
Dictionary (411 Syllable table) • Using our traditional 411 syllables (plus silence) word phones_list …… …… …...
Task Grammar Rule • Task: Phone Dialing • Dial three two six five four • Dial nine zero four one oh nine • Phone Woodland • Call Steve Young
Generate Task Grammar Rule Network • Task: free-syllable decoding • Define gram file $syllable=zhi | chi | ri | a | ……; ( SENT-START < $syllable [sil] > SENT-END ) • Parsing gram by HParse wdnet
Free-Syllable Decoding “wdnet” Syntax : # I – Nodes # J – arcs N=? L=? # define nodes I=x W=www … # define arcs J=x S=y E=z ….. ……
Database Preparation • Collect speech data • MLF (Master Label File) • The content is in word level • Transcribe the collected speech database Corpus files (training/test set) Script files • Change MLF into Phone level labeling • Feature Extraction (MFCC)
Word Master Label File #!MLF!# “*/4_t0062_t0062331.lab” tai yin . “*/4_t0062_t0062340.lab” . . Phone Master Label File #!MLF!# “*/4_t0062_t0062331.lab” sil t ai NULL yin sil . “*/4_t0062_t0062340.lab” . . Word/Phone-Level Transcriptions using HLEd to transform
EX IS sil sil DE sp
Feature Extraction • HCOPY : Data Copy (with format changing)
Script files • codetr.scp source destination
# byte order NATURALREADORDER=TRUE NATURALWRITEORDER=TRUE # Waveform parameters SOURCEFORMAT=ALIEN HEADERSIZE=256 SOURCERATE=1250.0 # Coding parameters TARGETKIND=MFCC_E TARGETRATE=100000.0 SAVECOMPRESSED=F SAVEWITHCRC=T WINDOWSIZE=320000.0 # ZMEANSOURCE=T USEHAMMING=T PREEMCOEF=0.97 NUMCHANS=20 USEPOWER=F #normalized the dynamic range of MFCC CEPLIFTER=22 LOFREQ=0 HIFREQ=4000 NUMCEPS=12 ENORMALISE=T DELTAWINDOW=2 ACCWINDOW=2 Feature Extraction Configuration
HMM Configuration • Config File (command-level) Command –C config_file • User Defaults > export HCONFIG=my_HTK_config • Built-in Defaults ref Chap 18 in HTK manual
~o <VecSize> 39 <MFCC_Z_E_D_A> ~h "proto" <BeginHMM> <NumStates> 5 <State> 2 <NumMixes> 4 <Mixture> 1 0.25 <Mean> 39 …… <Variance> 39 …… <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.0 0.0 0.0 0.0 <EndHMM> HMM Prototype Definition
Training Procedure • Model Initialization • Flat start (unknown segmentation uniform segmentation) • Viterbi search (given segmentation) • Forward/backward only in word level • Model Refinement • Mixture splitting
Configuration file for Training/Test # byte order #BYTEORDER=VAX NATURALREADORDER=TRUE NATURALWRITEORDER=TRUE # MFCC parameters SOURCEFORMAT=HTK SOURCERATE=100000.0 TARGETKIND=MFCC_E_D_A_Z TARGETRATE=100000.0 DELATWINDOW=2 ACCWINDOW=2
Training Corpus • Mat4500_train.scp • Mat4500_train_phones.mlf …… ……
Flat start Viterbi search Forward/Backward
Utterance Segmentation *.mlf • mat4500_train.mlf (phone-level with segmentation information) …
Silence and short pause model • sp share the middle state for silence • Sil.hed: AT 2 4 0.2 {sil.transP} AT 4 2 0.2 {sil.transP} AT 1 3 0.3{sp.transP} TI silst {sil.state[3],sp.state[2]}
Mixture Splitting Script • MU2.hed ……
Recognition/Evaluation Procedure Recognition Evaluation
Test Corpus • Mat4500_test.scp • Mat4500_test.mlf ……
Force Alignment • Viterbi decoding • HVite using option -a • You can get some statistics of the HMM segmentation • Useful for mixture number determined
Speaker Adaptation – MLLR, MAP • MLLR • In training phase generate the states occupation statistics % HERest –s • HHed RN “models” //ReName hmmid LS “stats” //loads states occupation statistics RC 32 “rtree” //Regression class = 32 or RC 32 “rtree” {sil.state[2-4].mix}
force alignment of adaptation data %Hvite … -a … -I adapWords.mlf -m …. • Find global MLLR %HEAdapt –C … -g … -K global.tmf …-I adapPhone.mlf …. *.tmf : transform model file • Find MLLR regression Tree] %HEAdapt –C … -J global.tmf –K rc.tmf …-I adapPhone.mlf … • Recognition %HVite … -J rc.tmf ….
MAP adaptation • HEAdapt –C … -j 0.9 …-k …-I adapPhone.mlf … -j : weight -k : using MLLR before MAP
Further topics • Model/state tying (HMM definition) • Context-dependent model • Fast training/search (Beam search) • Insertion/Deletion problem Duration constraint word transition penalty • Word Lattice output
Detail options for the HTK commands • HCompV • Typical arguments HCompV –C xxx –f 0.01 –m –S *.scp –M output_dir hmm • -m : update mean • -f f : set varFloor to f*global variance in hmm macro ~o … ~v “varFloor1” <Variance> 38 ………………..
Detail options for the HTK commands • HERest • Typical arguments HERest –C xxx –I *.mlf –t 250.0 150.0 1000.0 -S *.scp –H hmm_macros –H hmm_defs –M output_dir hmmlist • -t f [i l] : set the pruning threshold to f f f+i until f=l • -T tracing option octal number, command dependent • 00020 show occupation counts
Detail options for the HTK commands • HVite • Typical arguments HVite –H hmm_macros –H hmm_defs –S *.scp –i output_mlf –w wdnet –p 0.0 –s 5.0 –t 250 dict tiedlist • -t f [i l] : set the pruning threshold to f f f+i until f=l • -m : show model boundaries • -a : force alignment, -I input.mlf • -p, -s : word insertion penalty, weight for grammar score
Detail options for the HTK commands • HResult • Typical arguments HResult –I *.mlf hmmlist answer.mlf • -n : use NIST • -e s t : label t is made equivalent to s
Detail options for the HTK commands • HInit • Typical arguments HInit –S *.scp –M hmm_macro –H hmm_defs model • HRest • Typical arguments HRest –S *.scp –M hmm_macro –H hmm_defs model • HSLab • Use wavesufer.