Continuous Word Recognition

Database text speech text text text Scoring Continuous Word Recognition

Continuous Word Recognition • Problems with isolated word recognition: • we don´t know the limits of the words. • Increases variability • coarticulation of “words” • Speech velocity

Issues in the choice of subword units sets • Context Independent: Phonemes • Few, easy to train. • A HMM is to poor to take into account the phone context variability. • SPHINX: model_architecture/Telefonica.ci.mdef • Context Dependent:Triphone: • Phonemes in a specific context at left and right. • Very Larger number of units to train. • Very Large quantity of training data needed. • SPHINX: model_architecture/Telefonica.untied.mdef

Clustering Acoustic-Phonetic Units • Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states. • A senone is a cluster of similar Markov states. • Advantages: • More training data. • Less memory used.

Senonic Decision Tree (SDT) • SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined

Linguistic Questions

Decision Tree for Classifying the second state of k-triphone Is left phone (LP) a sonorant or nasal? yes Is left phone /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

When applied to the word welcome Is left phone (LP) a sonorant or nasal? yes Is left phone /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease • Sphinx: • Construction: $base_dir/ c_scripts/03.bulidtrees. • Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree • When the tree grows, it needs to be pruned • Sphinx: • $base_dir/ c_scripts/ 04.bulidtrees. • Results:aA • $base_dir/trees/Telefonica.500/A-0.dtree • $base_dir/Telefonica_arquitecture/Telefonica.500.mdef

Subword unit Models based on HMMs

Words • Words can be modeled using composite HMMs • A null transition is used to go from one subword unit to the following /sil/ /uw/ /sil/ /t/

Database text speech text text text Scoring Continuous Speech Training

For each utterance to train, the subword units are concatenated to form words model. • Sphinx: Dictionary • $base_dir/training_input/dict.txt • We can estimate parameters using the forward-backward reestimation formulas already described.

The ability to automatically align each individual HMM to the corresponding unsegmente speech observation sequence is one of the most powerfull features in the forward-backward algorithm.

Database text speech text text text Scoring

Language Models for Large Vocabulary Speech Recognitin • Goal: • Provide an estimate of the probability of a “word” sequence (w1 w2 w3 ...wQ) for the given recognition task. • This can be solved as follows:

Since, it is impossible to reliable estimate the conditional probabilities, • hence in practice it is used an N-gram language model: • En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram). j

Examples: • Unigram: • P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro) • Bigram: • P(Maria|<sil>)P(loves|Maria)P(Pedro|loves)P(</sil>|Pedro)

CMU-Cambridge Language Modeling Tools • $base_dir/c_scripts/languageModelling

C(Wi-2 Wi-1 Wi ) P(Wi| Wi-2,Wi-1)= C(Wi-2 Wi-1) where C(Wi-2 Wi-1 )=Total Number Sequence Wi-2 Wi-1 was observed C(Wi-2 Wi-1 Wi ) =Total Number Sequence Wi-2 Wi-1 Wi was observed

0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Unigram P(Wi)

Bigram P(Wi| Wi-1) merge Word 1 Acoustic model Null state Null state expand Word 2 Acoustic model Null state Null state Null state Word n Acoustic model Null state

Subword unit Models based on HMMs

0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Viterbi Algorithm

1. Inicialisation: 2. Calculate every state of every model using: t.probj+log(aij) 3. Update every internal state of every model: max(t.probj+log(aij)) 4. Update state N of every model: 5. Find mode with higher logProbability 0.3 0.2 0.2 0.1 0.7 0.3 0.2 0.5 0.1 0.5 6. Update limits table 1.0 1.0 0.7 0.5 0.3 0.7 2.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 7. Copy token in state 1 of each model. 5.3 0 4.5 0 4.5 0 1.2 0 3.2 0 2.1 0 5.3 0 4.5 0 2.2 0 2.7 0 5.2 0 4.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 2.7 0 4.5 0 2.3 0 3.5 0 4.3 0 2 0 4.5 0 7.3 0 4.5 0 1.3 0 Each state as a token t t.probj t.startj • frame at which the token at • state j enter the model w.model 1 0 w.start

0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 For N words

Recovering the uttered words. • Array w, is the same length as the number of observations. • This array gives information of the limits of the words. • At the end of the utterance, value w.model[M] stores the last HMM recognised sequence, and the predecesor models are obtain by “tracking back” trough the array.

Finite State Syntax (FSS) • With token passing the FSS are straightforward implemented. Monterrey Monterrey to Frankfort Frankfort from Houston Houston

Large Vocabulary Continuous Speech Recognition • Problems with isolated word recognition: • Do not easily account for variations in word pronunciation accross different dialects, etc. • Solution: • Use subword speech units. • Problems with large vocabulary continuous Speech Recognition: • we don´t know the limits of the subword units. • Increases variability • coarticulation of subword units • Speech velocity

La probabilidad MAP (Maximum A Posteriory) de la cadena de palabras W dadas la observaciones: • Utilizando la regla de Bayes:

ya que P(O) es independiente de W • Modelo Acústico: • Sub-palabras. • PhoneLike Units(PLU):50 • Syllable-like Units (SLU): 10,000 • Dyad: 2000 • Acusticas (Clustering): 256-512 • Lenguaje Model: • Restricciones • Sintácticas: • “parser” o • n-gram (n=2,3,4) • Par de palabras (word pair) • semánticas del Lenguaje.

Smoothing • Since many trigrams are rarerly found, even in large amount of text. Smoothing is used as follows:

Optimal Linear Smoothing

0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Recognition

1. Inicialisation: 2. Calculate every state of every model using: t.probj+log(aij) 3. Update every internal state of every model: max(t.probj+log(aij)) 4. Update state N of every model: 5. Find mode with higher logProbability 0.3 0.2 0.2 0.1 0.7 0.3 0.2 0.5 0.1 0.5 6. Update limits table 1.0 1.0 0.7 0.5 0.3 0.7 2.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 7. Copy token in state 1 of each model. 5.3 0 4.5 0 4.5 0 1.2 0 3.2 0 2.1 0 5.3 0 4.5 0 2.2 0 2.7 0 5.2 0 4.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 2.7 0 4.5 0 2.3 0 3.5 0 4.3 0 2 0 4.5 0 7.3 0 4.5 0 1.3 0 Each state as a token t t.probj t.startj • frame at which the token at • state j enter the model w.model 1 0 w.start

0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 For N subunits

Recovering the uttered words. • Array w, is the same length as the number of observations. • This array gives information of the limits of the words. • At the end of the utterance, value w.model[M] stores the last HMM recognised sequence, and the predecesor models are obtain by “tracking back” trough the array.

Finite State Syntax (FSS) • With token passing the FSS are straightforward implemented. Monterrey Monterrey to Frankfort Frankfort from Houston Houston

Initial Estimates • Reestimation equations give parameter values which converge to a local maximum. • Experience shown that • aijparameters converge to the global parameter without problem. • state distribution function (bi,o(t) ) or parameters (mean and variance) need good initial estimates. • Segmental k-means Segmentation into States

Segmental k-means Segmentation into states Chosen randomly or any available model. Using Viterby Algorithm and back tracking. Number of vectors k bj(k)= Number of vectors in j. m =Mean of the vector in the state j j s =Mean of the vector in the state j j

State Duration HMM Modelling • For most physical signals, the exponential distribution is inappropriate. • In order to improve modeling, we have to incorporate state duration information in a HMM. • Incorporate state duration information into the mechanics of HMMs. • Heuristic method.

Heuristic for incorporating state duration into HMM • At training: • Segmental k-means algorithm is used. • Calculate the state duration probability pj(d). • At recognition: • Viterbi algorithm to obtain • The logProbability and • the best segmentation via backtracking. • The duration of each state is measure from the state segmentation. • A post processor increases the logProbability as follows:

0.00477 0.0 0.0 0.024 0.01008 0.00954 0.0 0.0 0.28 0.0204 0.00311 0.0 0.8 0.16 0.00256 0.00005 0.0 0.0 0.0 0.0 Implementation Advise • If the number of models, in the decoding process, is very large, then try to save memory. Observe that for a given time t, it only need a limited quantity of information: s t

s 0.00477 0.00932 0.0 0.0 0.0 0.0 0.037 0.024 0.01008 0.02008 0.0154 0.00954 0.0 0.0 0.0 0.0 0.28 0.61 0.0204 0.074 0.00561 0.00311 0.0 0.0 0.8 0.8 0.16 0.32 0.00256 0.0046 0.00025 0.00005 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 s

s s 0.0133 0.01972 0.0035 0.00932 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.037 0.037 0.037 0.037 0.02008 0.02008 0.02008 0.02008 0.0154 0.0154 0.0154 0.0154 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.61 0.61 0.61 0.61 0.074 0.074 0.074 0.074 0.00561 0.00561 0.00561 0.00561 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.32 0.32 0.32 0.32 0.00025 0.00025 0.00025 0.0046 0.0046 0.0046 0.0046 0.00025 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 s s

Continuous Word Recognition

Continuous Word Recognition

Presentation Transcript

Auditory Word Recognition

Word Recognition

Handwritten Word Recognition ( preprocessing )

Visual Word Recognition

Visual Word Recognition

Visual Word Recognition II

Word Recognition Strategies

PICTURE V. WORD RECOGNITION

Decoding and word recognition

Word Recognition Inventory

Social behavior recognition in continuous video

Continuous acoustic detail affects spoken word recognition

Word Recognition Device

Developing Word Recognition Skills

Usability of Continuous Speech Recognition Programs

Farsi Handwritten Word Recognition Using Continuous Hidden Markov Models and Structural Features

Automatic Continuous Speech Recognition

Word Recognition

Wake-Up-Word Speech Recognition:

Word Recognition (Sereno, 4/04)

Word Recognition

Hybrid Systems for Continuous Speech Recognition