230 likes | 366 Views
T-61.184 Informaatiotekniikan erikoiskurssi IV. HMMs and Speech Recognition. based on chapter 7 of D. Jurafsky, J. Martin: Speech and Language Processing. Jaakko Peltonen October 31, 2001. 1. Contents. speech recognition architecture HMM, Viterbi, A* speech acoustics & features
E N D
T-61.184 Informaatiotekniikan erikoiskurssi IV HMMs and Speech Recognition based on chapter 7 ofD. Jurafsky, J. Martin: Speech and Language Processing Jaakko PeltonenOctober 31, 2001
1 Contents • speech recognition architecture • HMM, Viterbi, A* • speech acoustics & features • computing acoustic probabilities • speech synthesis
Speech Recognition Architecture SpeechWaveform Feature Extraction(Signal Processing) SpectralFeatureVectors Neural Net Phone LikelihoodEstimation (Gaussiansor Neural Networks) PhoneLikelihoodsP(o|q) N-gram Grammar Decoding (Viterbior Stack Decoder) HMM Lexicon Words 2 • Application: LVCSR • Large vocabulary: dictionary size 5000 – 60000 • Continuous speech (words not separated) • Speaker-independent
3 Noisy Channel Model revisited • acoustic input considered a noisy version of a source sentence • decoding: find the sentence that most probably generated the input • problems: - metric for selecting best match? - efficient algorithm for finding best match
4 Bayes revisited • acoustic input: symbol sequence • sentence: string of words • best match metric: probability • Bayes’ rule:• observation likelihood • prior probability• acoustic model • language model
Hidden Markov Models (HMMs) 5 • previously, Markov chains used to model pronounciation • forward algorithm phone sequence likelihood • real input is not symbolic: spectral features • input symbols do not correspond to machine states • HMM definition: • • state set Q, • observation symbols O ≠ Q• transition probabilities A • B not limited to 1 and 0• start and end state(s)• observation likelihoods B
6 HMMs, continued a24 Word Model a11 a22 a33 a01 a12 a23 a34 start0 n1 iy2 d3 end4 b1(o3) b1(o5) b1(o1) b1(o2) b1(o6) b1(o4) ObservationSequence … … o1 o2 o3 o4 o5 o6
7 The Viterbi Algorithm • word boundaries unknown segmentation[ay d ih s hh er d s ah m th ih ng ax b aw …]I just heard something about… • assumption: dynamic programming invariant • If ultimate best path for o includes state qi , it includes the best path up to & including qi . • does not work for all grammars
8 Viterbi, continued b(ax,aw)left b(ax,aw)middle b(ax,aw)right function VITERBI(observations of len T, state-graph) returnsbest-pathnum_states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,T+2]viterbi[0,0]1.0for each time step tfrom 0 toTdofor each state sfrom 0 tonum-statesdofor each transition s’ from s specified by state-graphnew-scoreviterbi[s,t]*a[s,s’]*bs’(ot)if ((viterbi[s’,t+1] = 0) || (new-score > viterbi[s’,t+1]))thenviterbi[s’,t+1]new-scoreback-pointer[s’,t+1]s Backtrace from highest probability state in the final column of viterbi[] and return path. • single automaton combine single-word networks add word transition probabilities = bigram probabilities • states correspond to subphones & context • beam search
9 Other Decoders • Viterbi has problems: • computes most probable state sequence, notword sequence • Cannot be used with all language models (only bigrams) • Solution 1: multiple-pass decoding • N-best-Viterbi: return N best sentences, sort with more complex model • word lattice: return directed word graph + word observation likelihoods refine with more complex model
10 A* Decoder • Viterbi uses an approximation of the forward algorithm: max instead of sum • A* uses the complete forward algorithm correct observation likelihoods, use any language model • ’Best-first’ search of word sequence tree: • priority queue of scored paths to extend • Algorithm: 1. select highest-priority path (pop queue) 2. create possible extensions (if none, stop) 3. calculate scores for extended paths (from forwardalgorithm and language model) 4. add scored paths to queue
11 A* Decoder, continued p(acoustic|music)=forward probability p(music|if) music32 p(acoustic|if)=forward probability if30 muscle31 p(if|START) messy25 (none)1 Alice40 was29 wants24 Every25 walls2 In4
12 A* Decoder, continued • score of word string w is not (y is the acoustic string) • reason: path prefix would have higher score • score: A* evaluation function • score from start to current string end • : estimated score of best extension to utterance end
Acoustic Processingof Speech 13 • wave characteristics: frequency pitch, amplitude loudness • visible information: vowel/consonant, voicing, length, fricatives, stop closure • spectral features: Fourier spectrum / LPC spectrum - peaks characteristic of different sounds formants • spectrogram: changes over time • digitization: sampling, quantization • processing cepstral features / PLP features
Computing Acoustic Probabilities 14 • simple way: vector quantization (cluster feature vectors & count cluster occurrences) • continuous approach: calculate probability density function (pdf) over observations • Gaussian pdf: trained with forward-backward algorithm • Gaussian mixtures, parameter tying • Multi-layer perceptron (MLP) pdf: trained with error back-propagation
Training A Speech Recognizer 15 • evaluation metric: word error rate 1. Compute minimum edit distance between hypothesized and correct string 2. • e.g. correct: ”I went to a party” hypothesis: ”Eye went two a bar tea”3 substitutions, 1 deletion word error rate 80% • State of the art: word error rate 20% on natural- speech tasks
16 Embedded Training • models to be trained: - language model: p(wi|wi-1wi-2) - observation likelihoods: bj(ot) - transition probabilities: aij - pronounciation lexicon: HMM state graph • training data: - corpus of speech wavefiles + word-transcription - large text corpus for language model training - smaller corpus of phonetically labeled speech • N-gram language model: trained as in Chapter 6 • HMM lexicon structure: built by hand - PRONLEX, CMUdict ”off-the-shelf” pronounciation dictionaries
Embedded Training,continued 17 • HMM parameters: - initial estimate: equal transition probabilities, observation probabilities bootstrapped (labeled speech label for each frame initial Gaussian means / variances) • - MLP systems: forced Viterbi alignmentfeatures & correct words given best states labels for each input retrain MLP - Gaussian systems: forward-backward algorithm compute forward & backward probabilities re-estimate a and b. Correct words known prune model
18 Speech Synthesis • text-to-speech (TTS) system: output is a phone sequence with durations and a FO pitch contour • waveform concatenation: based on recorded speech database, segmented into short units • simplest: 1 unit / phone, join units & smooth edges • triphone models: too many combinationsdiphones used • diphones start/end midway through a phone for stability • does not model pitch & duration changes (prosody)
Speech Synthesis, continued 19 • use signal processing to change prosody • LPC model separates pitch from spectral envelope to modify pitch: generate pulses in desired pitch, re-excite LPC coefficients modified wave to modify duration: contract/expand coefficient frames • TD-PSOLA: frames centered around pitchmarks to change pitch: make pitchmarks closer together / further apart to change duration: duplicate / leave out frames recombine: overlap and add frames
Speech Synthesis, continued 20 • problems with speech synthesis: - 1 example/diphone is insufficient - signal processing distortion - subtle effects not modeled • unit selection: collect several examples/unit with different pitch/duration/linguistic situation • selection method: - FO contour with 3 values/phone, large unit corpus 1. find candidates (closest phone, duration & FO) rank them by target cost (closeness) 2. measure join quality of neighbour candidates rank joins by concatenation cost - pick best unit set more natural speech
Human SpechRecognition 21 • PLP analysis inspired by human auditory system • lexical access has common properties: - frequency - parallelism - neighborhood effects - cue-based processing (phoneme restoration) formant structure, timing, voicing, lexical cues, word association, repetition priming • differences: - time-course: human processing is on-line - other cues: prosody
22 Exercises 1. Hand-simulate the Viterbi algorithm: use the Automaton in Figure 7.8, on input [aa n n ax n iy d]. What is the most probable string of words? 2. Suggest two functions for use in A* decoding. What criteria should the function satisfy for the search to work (i.e. to return the best path)?