180 likes | 351 Views
Chapter 13 Speech Recognition Systems. 13.1 Isolated Word Recognition Systems 13.2 Connected Speech Recognition Systems 13.3 Continuous Speech Recognition Systems 13.4 Word Spotting Systems. 13.1 Isolated Word Recognition Systems (1). 1. Template Matching System by DTW
E N D
Chapter 13 Speech Recognition Systems • 13.1 Isolated Word Recognition Systems • 13.2 Connected Speech Recognition Systems • 13.3 Continuous Speech Recognition Systems • 13.4 Word Spotting Systems
13.1 Isolated Word Recognition Systems (1) • 1. Template Matching System by DTW • For speaker dependent system, only one set of templates is enough. It is called reference set. For any incoming test word, the DTW is used to get a set of minimum distance. Then the best matched word in the set will be output.
Isolated Word Recognition Systems (2) • For speaker independent, a couple of set of template are needed. If speaker number is very large, we need do some clustering for the templates to limit the number of templates to about 5-10. Of course in that case the computing overhead will be large.
Isolated Word Recognition Systems (3) • 2. System by HMM • Basically this type of system will be speaker independent. For small vocabulary, every speaker will be asked to speak one set of specified word. All these utterances will be used to train the models for every word.
Isolated Word Recognition Systems (4) • Main Steps : • (1) Feature vectors: Data sample is 8-16KHz, 8-16bits; the processing are : pre-emphasized (1-Z); framing (160-400 samples per frame); windowing; LPCC or MFCC (feature vector of 26-39 dimensions) • (2) HMM structures and number of states Every word has a HMM and every model has same (or different if necessary) number of states. For both Chinese and English, the best number of states is 3 states for one syllable. In general, the left-right model without skip will be used.
Isolated Word Recognition Systems (5) • (3) Discrete HMM vs Continuous HMM If discrete HMM is used, the feature vector must be processed by VQ to get discrete label (or symbol). The best length of codebook is 64,128 or 256. If continuous HMM is used, the number of mixture and the element of transition matrix need to be selected. The big mixture number (>=5) and diagnosis matrix is better. For bj(k), a small probability(ε) is better than zero probability. (An example : ε=0, err = 12% andε=10-8-10-3, err=4%) • (4) Speaker number for training (30~100)
Isolated Word Recognition Systems (6) • (5) Results and conclusions for English digits experiments 100 speakers (50 male, 50 female); code length of VQ = 64; mixture number = 5; diagnosis matrix was used DTW DTW/VQ HMM/CD HMM/VQ same 100 0.2% 3.5% 0.2% 3.7% other 200 1.55% 1.55% conclusions: 1. For VQ , err is bigger (18 times!), so VQ is not so good. 2. For test speaker, outside is worse than inside (8 times!)
13.2 Connected Speech Recognition Systems (1) • Mainly for connected digits (no grammar) • 1. System using level-building and DTW for speaker dependent (one set of template) or speaker independent (multi set of template). The reference templates should be linearly expanded or contracted into same frames. • Results and discussion
Connected Speech Recognition Systems (2) • 2. System by HMM • Every state has energy probability distribution and duration probability distribution • Training unit needs optimal partition of digits. It could be done manually or by k-means procedure. (Now there are some tools only ask to do little labor work)
13.3 Continuous Speech Recognition Systems (1) • Large vocabulary, speaker independent, continuous speech recognition is the most significant, challenged and applicable research topic. • Error rate will be 50 times for small vocabulary, speaker dependent and isolated system. It was 0.2%-2.5% in 1980’s. But for the previous it is about 10% to 15%.
Continuous Speech Recognition Systems (2) • Acoustic-phonetic layer, Word layer and Syntactic layer • Acoustic-phonetic layer uses sub-word unit to be the output. The acoustic model are based on phone or phoneme. • Word layer specify the vocabulary and how the phones to be constructed to form the words in vocabulary.
Continuous Speech Recognition Systems (3) • Syntactic layer specify how to combine the words into sentences ( by what kind of rules) • If domain is specific (narrow), the rules will be restrict, and the sentences are very limited. It has high accuracy but limited application. If constraints were loosing then accuracy will be reduced, but application will be wider. Perplexity is used to measure.
Continuous Speech Recognition Systems (4) • Acoustic-phonetic Layer Designing • (1)Feature vector design for acoustic layer • (2)Phone models and their training • CI(Context independent) and CD • LCD, RCD and TRIPHONE • Application of clustering for different context
Continuous Speech Recognition Systems (5) • (3) Constructing word by phones by phone network • Substitution, deletion and insertion
Continuous Speech Recognition Systems (6) • Syntactic layer designing • Processing for language models
13.4 Word Spotting Systems (1) • To identify or recognize the key words in a continuous utterances. • These are for surveillance, communication and input to computer • Detecting Rate (Figure of Merit, FOM) • False Alarm Rate( FAR) • Conflict between them
Word Spotting Systems (2) • ROC(Receiver Operating Curve) • Filler (wasting material) • Basic structure of Word Spotting System • Verification of passed words to reduce the false alarm rate. ( by neural network )