Automatic speech recognition

Speech recognition Automatic speech recognition Contents ASR systems ASR applications ASR courses Presented by Kalle Palomäki Teaching material: Kalle Palomäki & Mikko Kurimo

Speech recognition About Kalle • Background: Acoustics and audio, auditory brain measurements, hearing models, noise robust ASR • PhD 2005 at TKK • Research experience at: • Department of Signal Processing and Acoustics • Department of Information and computer science, Aalto • University of Sheffield, Speech and Hearing group • Team leader of noise robust ASR team, Academy research fellow • Current research themes • Hearing inspired missing data approach in noise robust ASR • Sound separation and feature extraction

Speech recognition Goals of today Learn what methods are used for automatic speech recognition (ASR) Learn about typical ASR applications and things that affect the ASR performance Definition: Automatic speech recognition, ASR = transformation of speech audio to text

Speech recognition Orientation • What are the main challenges faced in automatic speech recognition • Try to think of three most important ones with your pair

ASR tasks and solutions • Speaking environment and microphone • Office, headset or close-talking • Telephone speech, mobile • Noise, outside, microphone far away • Style of speaking • Speaker modeling

ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Isolated words • Connected words, small vocabulary • Word spotting in fluent speech • Continuous speech, large vocabulary • Spontaneous speech, ungrammatical • Speaker modeling

ASR tasks and solutions • Speaking environment and microphone • Style of speaking • Speaker modelling • Speaker-dependent models • Speaker-independent, average speaker models • Speaker adaptation

Automatic speech recognition Large-vocabulary continuous speech (LVCSR) Complex pattern recognition system that utilizes many Probablistic models at different hierarchical levels Transform speech to text Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

Speech recognition What is speech recognition? Find the most likely word sequence given the acoustic signal and statistical models! Acoustic model defines the sound units independent of speaker and recording conditions Language model defines words and how likely they occur together Lexicon (vocabulary) defines the word set and how the words are formed from sound units

Speech recognition What is speech recognition? Find the most likely word sequencegiven the acoustic observations and statistical models

Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model

Preprocessing & Features Extract the essential information from the signal Describe the signal by compact feature vectors computed from short time intervals Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

s t (n) Audio signal | DFT{s t(n)} | Magnitude spectrogram St,f Auditory frequency resolution Mel spectrogram St,j Compression log {St,j} Mel frequency cepstral coefficients (MFCC) De-correlation Discrete cosine transformation

Acoustic modeling Find basic speech units and their models in the feature space Given the features compute model probabilities Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

Speech recognition Phonemes Basic units of language Written language: letter Spoken language: phoneme Wikipedia: “The smallest contrastive linguistic unit which may bring about a change of meaning” There are different writing systems, e.g. IPA (International Phonetic Alphabet)‏ The phoneme sets differ depending on language

Speech recognition IPA symbols for US English Speech recognition

Speech recognition 1dim. Gaussian mixture model Picture by B.Pellom

Speech recognition Gaussian mixture model GMM Picture by B.Pellom

Training GMM Classifier Data collected _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

0 Testing 0.05 0.4 0.05 0.5 Sum()=1 _ k kkkkaeaeaeae _ _ _ _ t tttt _ _ _ _ _ _ _

Speech recognition How to model a sequence of phonemes (or GMMs)? ? Picture by B.Pellom

Speech recognition Hidden Markov Model (HMM), 3-states Transitions Observation probabilities b(o1) b(o2) b(o3) o1 o2 o3 Acoustic observations GMM1 GMM2 GMM3

Speech recognition Hidden Markov Model (HMM)1-state transitions Observation sequence: O={o1, o2, o3} Observation probability sequence: B={b(o1), b(o2), b(o3)} P=b(o1)*a11* b(o2)* a11*b(o3)* aO

0 0.05 Realistic scenario 0.4 0.05 0.5 _ k ae _ t _ Sum()=1 _ _ k t k t k aeowaeae _ _ _ _ t kk t t _ _ _ _ _ _ _

_ _ ae k t _ k ae _ t 0.8 0.8 0.79 0.79 0.2 0.2 0.9 0.9 0.2 0.21 0.8 0.1 _ k t k aeowae _ _ t t k _ _ _ _ GMM

Exercise 1. Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow _ ae k ow t

Calculate likelihood of phoneme sequence /k/ /ow/ as for word cow. Observation probabilities, temporal alignment, and a set of 1-state phoneme HMMs shown below. alignment k ow 0.4* 0.2 * 0.5 *0.92* 0.4* 0.92*0.5 = 0,006771 k ow

Context dependent HMMs Triphone HMMs for: /_/, /k/, /ae/, /t/, /_/ _ kae aet _ k ae t

More on HMMs Lecture 12-Feb, “Sentence level processing” by Oskar Kohonen Exercise 6, “Hidden Markov Models”

Language modeling Gives a prior probability for any word (or phoneme sequence) Defines basic language units (e.g. words) Learns statistical models from large text collections Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

N-gram language model • Stochastic model of the relations between words • Which words often occur close to each other? • The model predicts the probability distribution of the next word given the previous ones • Estimated from large text corpuses i.e. millions of words • Smoothing and pruning required to learn compact long-span models from sparse training data • More information on lecture 26-Feb “Statistical language models” by Mikko Kurimo

Speech recognition N-gram models • ‏Trigram = 3-gram: • Word occurrence depends only on immediate context • A conditional probability of word given its context Picture by B.Pellom Speech recognition

Speech recognition Estimation of N-gram model c(“eggplant stew”) c(“eggplant”) • Bigram example: • Is a maximum likelihood estimate for prob. of wi given wj • c(wj,wi) count of wi,wjtogether • c(wj) count of wj • works well only for frequent bigrams Speech recognition

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts Calculate missing bi-gram probabilities

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 Calculate missing bi-gram probabilities

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 3 / 3256 = .00092 Calculate missing bi-gram probabilities

Data from Berkeley restaurant corpus (Jurafsky & Martin, 2000 “Speech and language processing”). Uni-gram counts 1087 / 3437=.32 6 / 1215 = .0049 3 / 3256 = .00092 Calculate missing bi-gram probabilities

Speech recognition On the N-gram sparsity • For Shakespeare’s complete works vocabulary size (word form types) is 29 066 • Total number of words is 884 647 • This makes number of possible bigrams 29 0662 = 844 million • Under 300 000 found in writings • Conclusion: even learned bigram model would be very sparse Speech recognition Speech recognition

Morphemes as language units • In many languages words are not suitable as basic units for the language models • Inflections, prefixes, suffixes and compound words • Finnish language has these issues • The best unitscarry meaning(e.g. just letters or syllables are not good) • -> Morpheme or “statistical morf” tietä+isi+mme+kö+hän would + we +really + know April 28, 2008 http://www.cis.hut.fi/projects/speech/

Speech recognition Lexicon for sub-word units? Better coverage, few or no OOVs, even new words Phonemes, syllables, morphemes, or stem+endings? un + re + late + d + ness unrelate + d + ness unrelated + ness How to split and rebuild words?

More about language models Lecture 26-Feb “Statistical language models” by Mikko Kurimo Exercise 3. N-gram language models

Decoding Join the acoustic and language probabilities Find the most likely sentence hypothesis by pruning and choosing the best Significant effect on recognition speed and accuracy Speech signal Feature extraction Acoustic modeling Recognized text Decoder Language modeling

Speech recognition What is speech recognition? After applying Bayes rule Find the most likely word sequence given the observations and models Acoustic model Language model

Speech recognition Decoding The task is to find the most probable word sequence, given models and the acoustic observations Viterbi search: Find the most probable state sequence An efficient exhaustive search by applying dynamic programming and recursion For Large Vocabulary Continuous Speech Recognition (LVCSR) the space must be pruned and optimized

Speech recognition N-best lists • Easy to apply long span LMs for rescoring • The differences are small • Not very compact representation • Tokens can be decoded into a lattice or word graph structure that shows all good options Picture by B.Pellom

Speech recognition Word graph representation Picture by B.Pellom

Speech recognition Automatic speech recognition Content today: ASR systems today ASR applications ASR courses

Speech recognition Typical applications User interface by speech Dictation Speech translation Audio information retrieval

Automatic speech recognition

Automatic speech recognition

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition: An Overview

Adaptation Techniques in Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic Speech Recognition II

Automatic Speech Recognition and Audio Indexing

Automatic Speech Recognition System

Confidence Measures for Automatic Speech Recognition

Automatic Speech Recognition

Automatic Continuous Speech Recognition

Automatic Speech Recognition Studies

Decoding Techniques for Automatic Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition