700 likes | 1.04k Views
Automatic speech recognition. Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo. About Kalle. Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR PhD 2005 at TKK
E N D
Speech recognition Automatic speech recognition Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo
Speech recognition About Kalle • Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR • PhD 2005 at TKK • Research experience at: • Department of Signal Processing and Acoustics • Department of Information and computer science, Aalto • University of Sheffield, Speech and Hearing group • Team leader of noise robust ASR team, Academy research fellow • Current research themes • Hearing inspired missing data approach in noise robust ASR • Sound separation and feature extraction
Speech recognition Goals of today Learn what methods are used for automatic speech recognition (ASR) Learn about typical ASR applications and things that affect the ASR performance Definition: Automatic speech recognition, ASR = transformation of speech audio to text
Speech recognition Orientation • What are the main challenges faced in automatic speech recognition • Try to think of three most important ones with your pair
ASR tasks and solutions • Speaking environment and microphone • Office, headset or close-talking • Telephone speech, mobile • Noise, outside, microphone far away • Style of speaking • Speaker modeling
ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Isolated words • Connected words, small vocabulary • Word spotting in fluent speech • Continuous speech, large vocabulary • Spontaneous speech, ungrammatical • Speaker modeling
ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Speaker modelling • Speaker-dependent models • Speaker-independent, average speaker models • Speaker adaptation
Automatic speech recognition Large-vocabulary continuous speech (LVCSR) Complex pattern recognition system that utilizes many Probablistic models at different hierarchical levels Transform speech to text Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling
Speech recognition What is speech recognition? Find the most likely word sequence given the acoustic signal and statistical models! Acoustic model defines the sound units independent of speaker and recording conditions Language model defines words and how likely they occur together Lexicon (vocabulary) defines the word set and how the words are formed from sound units
Speech recognition What is speech recognition? Find the most likely word sequencegiven the acoustic observations and statistical models
Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model
Preprocessing & Features Extract the essential information from the signal Describe the signal by compact feature vectors computed from short time intervals Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling
s t (n) Audio signal | DFT{s t(n)} | Magnitude spectrogram St,f Auditory frequency resolution Mel spectrogram St,j Compression log {St,j} Mel frequency cepstral coefficients (MFCC) De-correlation Discrete cosine transformation
Acoustic modeling Find basic speech units and their models in the feature space Given the features compute model probabilities Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling
Speech recognition Phonemes Basic units of language Written language: letter Spoken language: phoneme Wikipedia: “The smallest contrastive linguistic unit which may bring about a change of meaning” There are different writing systems, e.g. IPA (International Phonetic Alphabet) The phoneme sets differ depending on language
Speech recognition IPA symbols for US English Speech recognition
Speech recognition 1dim. Gaussian mixture model Picture by B.Pellom
Speech recognition Gaussian mixture model GMM Picture by B.Pellom
Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _
Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _
Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _
0 Testing 0.05 0.4 0.05 0.5 Sum()=1 _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _
Speech recognition How to model a sequence of phonemes (or GMMs)? ? Picture by B.Pellom
Speech recognition Hidden Markov Model (HMM), 3-states Transitions Observation probabilities b(o1) b(o2) b(o3) o1 o2 o3 Acoustic observations GMM1 GMM2 GMM3
Speech recognition Hidden Markov Model (HMM)1-state transitions Observation sequence: O={o1, o2, o3} Observation probability sequence: B={b(o1), b(o2), b(o3)} P=b(o1)*a11* b(o2)* a11*b(o3)* aO
0 0.05 Realistic scenario 0.4 0.05 0.5 _ k ae _ t _ Sum()=1 _ _ k t k t k aeowaeae _ _ _ _ t kk t t _ _ _ _ _ _ _
_ _ ae k t _ k ae _ t 0.8 0.8 0.79 0.79 0.2 0.2 0.9 0.9 0.2 0.21 0.8 0.1 _ k t k aeowae _ _ t t k _ _ _ _ GMM
Exercise 1. Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow _ ae k ow t
Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow 0.4* 0.2 * 0.5 *0.92* 0.4* 0.92*0.5 = 0,006771 k ow
Context dependent HMMs Triphone HMMs for: /_/, /k/, /ae/, /t/, /_/ _ kae aet _ k ae t
More on HMMs Lecture 12-Feb, “Sentence level processing” by Oskar Kohonen Exercise 6, “Hidden Markov Models”
Language modeling Gives a prior probability for any word (or phoneme sequence) Defines basic language units (e.g. words) Learns statistical models from large text collections Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling
N-gram language model • Stochastic model of the relations between words • Which words often occur close to each other? • The model predicts the probability distribution of the next word given the previous ones • Estimated from large text corpuses i.e. millions of words • Smoothing and pruning required to learn compact long-span models from sparse training data • More information on lecture 26-Feb “Statistical language models” by Mikko Kurimo
Speech recognition N-gram models • Trigram = 3-gram: • Word occurrence depends only on immediate context • A conditional probability of word given its context Picture by B.Pellom Speech recognition
Speech recognition Estimation of N-gram model c(“eggplant stew”) c(“eggplant”) • Bigram example: • Is a maximum likelihood estimate for prob. of wi given wj • c(wj,wi) count of wi,wjtogether • c(wj) count of wj • works well only for frequent bigrams Speech recognition
Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts Calculate missing bi-gram probabilities
Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 Calculate missing bi-gram probabilities
Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 3 / 3256 = .00092 Calculate missing bi-gram probabilities
Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 6 / 1215 = .0049 3 / 3256 = .00092 Calculate missing bi-gram probabilities
Speech recognition On the N-gram sparsity • For Shakespeare’s complete works vocabulary size (word form types) is 29 066 • Total number of words is 884 647 • This makes number of possible bigrams 29 0662 = 844 million • Under 300 000 found in writings • Conclusion: even learned bigram model would be very sparse Speech recognition Speech recognition
Morphemes as language units • In many languages words are not suitable as basic units for the language models • Inflections, prefixes, suffixes and compound words • Finnish language has these issues • The best unitscarry meaning(e.g. just letters or syllables are not good) • -> Morpheme or “statistical morf” tietä+isi+mme+kö+hän would + we +really + know April 28, 2008 http://www.cis.hut.fi/projects/speech/
Speech recognition Lexicon for sub-word units? Better coverage, few or no OOVs, even new words Phonemes, syllables, morphemes, or stem+endings? un + re + late + d + ness unrelate + d + ness unrelated + ness How to split and rebuild words?
More about language models Lecture 26-Feb “Statistical language models” by Mikko Kurimo Exercise 3. N-gram language models
Decoding Join the acoustic and language probabilities Find the most likely sentence hypothesis by pruning and choosing the best Significant effect on recognition speed and accuracy Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling
Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model
Speech recognition Decoding The task is to find the most probable word sequence, given models and the acoustic observations Viterbi search: Find the most probable state sequence An efficient exhaustive search by applying dynamic programming and recursion For Large Vocabulary Continuous Speech Recognition (LVCSR) the space must be pruned and optimized
Speech recognition N-best lists • Easy to apply long span LMs for rescoring • The differences are small • Not very compact representation • Tokens can be decoded into a lattice or word graph structure that shows all good options Picture by B.Pellom
Speech recognition Word graph representation Picture by B.Pellom
Speech recognition Automatic speech recognition Content today: ASR systems today ASR applications ASR courses
Speech recognition Typical applications User interface by speech Dictation Speech translation Audio information retrieval