1 / 48

Continuous Word Recognition

Database. text. speech. text. text. text. Scoring. Continuous Word Recognition. Continuous Word Recognition. Problems with isolated word recognition: we don´t know the limits of the words. Increases variability coarticulation of “words” Speech velocity.

tosca
Download Presentation

Continuous Word Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database text speech text text text Scoring Continuous Word Recognition

  2. Continuous Word Recognition • Problems with isolated word recognition: • we don´t know the limits of the words. • Increases variability • coarticulation of “words” • Speech velocity

  3. Issues in the choice of subword units sets • Context Independent: Phonemes • Few, easy to train. • A HMM is to poor to take into account the phone context variability. • SPHINX: model_architecture/Telefonica.ci.mdef • Context Dependent:Triphone: • Phonemes in a specific context at left and right. • Very Larger number of units to train. • Very Large quantity of training data needed. • SPHINX: model_architecture/Telefonica.untied.mdef

  4. Clustering Acoustic-Phonetic Units • Many Phones have similar effects on the neighboring phones, hence, many triphones have very similar Markov states. • A senone is a cluster of similar Markov states. • Advantages: • More training data. • Less memory used.

  5. Senonic Decision Tree (SDT) • SDT Classify Markov States of Triphones represented in the training corpus by asking Linguistic Questions composed of Conjuntions, Disjunctions and/or negations of a set of predetermined

  6. Linguistic Questions

  7. Decision Tree for Classifying the second state of k-triphone Is left phone (LP) a sonorant or nasal? yes Is left phone /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

  8. When applied to the word welcome Is left phone (LP) a sonorant or nasal? yes Is left phone /s,z,sh,sh/? Is right phone (RP) a back-R? Is RF voiced? Senone 5 Senone 1 Senone 6 Is LP back L or ( LC neither a nasal or RF A LAX-vowel)? Senone 4 Senone 3 Senone 2

  9. The tree can automatically constructed by searching, for each node, the question that the maximum entropy decrease • Sphinx: • Construction: $base_dir/ c_scripts/03.bulidtrees. • Results: $base_dir/trees/Telefonica.unpruned/A-0.dtree • When the tree grows, it needs to be pruned • Sphinx: • $base_dir/ c_scripts/ 04.bulidtrees. • Results:aA • $base_dir/trees/Telefonica.500/A-0.dtree • $base_dir/Telefonica_arquitecture/Telefonica.500.mdef

  10. Subword unit Models based on HMMs

  11. Words • Words can be modeled using composite HMMs • A null transition is used to go from one subword unit to the following /sil/ /uw/ /sil/ /t/

  12. Database text speech text text text Scoring Continuous Speech Training

  13. For each utterance to train, the subword units are concatenated to form words model. • Sphinx: Dictionary • $base_dir/training_input/dict.txt • We can estimate parameters using the forward-backward reestimation formulas already described.

  14. The ability to automatically align each individual HMM to the corresponding unsegmente speech observation sequence is one of the most powerfull features in the forward-backward algorithm.

  15. Database text speech text text text Scoring

  16. Language Models for Large Vocabulary Speech Recognitin • Goal: • Provide an estimate of the probability of a “word” sequence (w1 w2 w3 ...wQ) for the given recognition task. • This can be solved as follows:

  17. Since, it is impossible to reliable estimate the conditional probabilities, • hence in practice it is used an N-gram language model: • En practice, realiable estimators are obtained for N=1 (unigram) N=2 (bigram) or possible N=3 (trigram). j

  18. Examples: • Unigram: • P(Maria loves Pedro)=P(Maria)P(loves)P(Pedro) • Bigram: • P(Maria|<sil>)P(loves|Maria)P(Pedro|loves)P(</sil>|Pedro)

  19. CMU-Cambridge Language Modeling Tools • $base_dir/c_scripts/languageModelling

  20. Database text speech text text text Scoring

  21. C(Wi-2 Wi-1 Wi ) P(Wi| Wi-2,Wi-1)= C(Wi-2 Wi-1) where C(Wi-2 Wi-1 )=Total Number Sequence Wi-2 Wi-1 was observed C(Wi-2 Wi-1 Wi ) =Total Number Sequence Wi-2 Wi-1 Wi was observed

  22. Database text speech text text text Scoring

  23. 0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Unigram P(Wi)

  24. Bigram P(Wi| Wi-1) merge Word 1 Acoustic model Null state Null state expand Word 2 Acoustic model Null state Null state Null state Word n Acoustic model Null state

  25. Subword unit Models based on HMMs

  26. 0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Viterbi Algorithm

  27. 1. Inicialisation: 2. Calculate every state of every model using: t.probj+log(aij) 3. Update every internal state of every model: max(t.probj+log(aij)) 4. Update state N of every model: 5. Find mode with higher logProbability 0.3 0.2 0.2 0.1 0.7 0.3 0.2 0.5 0.1 0.5 6. Update limits table 1.0 1.0 0.7 0.5 0.3 0.7 2.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 7. Copy token in state 1 of each model. 5.3 0 4.5 0 4.5 0 1.2 0 3.2 0 2.1 0 5.3 0 4.5 0 2.2 0 2.7 0 5.2 0 4.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 2.7 0 4.5 0 2.3 0 3.5 0 4.3 0 2 0 4.5 0 7.3 0 4.5 0 1.3 0 Each state as a token t t.probj t.startj • frame at which the token at • state j enter the model w.model 1 0 w.start

  28. 0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 For N words

  29. Recovering the uttered words. • Array w, is the same length as the number of observations. • This array gives information of the limits of the words. • At the end of the utterance, value w.model[M] stores the last HMM recognised sequence, and the predecesor models are obtain by “tracking back” trough the array.

  30. Finite State Syntax (FSS) • With token passing the FSS are straightforward implemented. Monterrey Monterrey to Frankfort Frankfort from Houston Houston

  31. Large Vocabulary Continuous Speech Recognition • Problems with isolated word recognition: • Do not easily account for variations in word pronunciation accross different dialects, etc. • Solution: • Use subword speech units. • Problems with large vocabulary continuous Speech Recognition: • we don´t know the limits of the subword units. • Increases variability • coarticulation of subword units • Speech velocity

  32. La probabilidad MAP (Maximum A Posteriory) de la cadena de palabras W dadas la observaciones: • Utilizando la regla de Bayes:

  33. ya que P(O) es independiente de W • Modelo Acústico: • Sub-palabras. • PhoneLike Units(PLU):50 • Syllable-like Units (SLU): 10,000 • Dyad: 2000 • Acusticas (Clustering): 256-512 • Lenguaje Model: • Restricciones • Sintácticas: • “parser” o • n-gram (n=2,3,4) • Par de palabras (word pair) • semánticas del Lenguaje.

  34. Smoothing • Since many trigrams are rarerly found, even in large amount of text. Smoothing is used as follows:

  35. Optimal Linear Smoothing

  36. 0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 Recognition

  37. 1. Inicialisation: 2. Calculate every state of every model using: t.probj+log(aij) 3. Update every internal state of every model: max(t.probj+log(aij)) 4. Update state N of every model: 5. Find mode with higher logProbability 0.3 0.2 0.2 0.1 0.7 0.3 0.2 0.5 0.1 0.5 6. Update limits table 1.0 1.0 0.7 0.5 0.3 0.7 2.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 7. Copy token in state 1 of each model. 5.3 0 4.5 0 4.5 0 1.2 0 3.2 0 2.1 0 5.3 0 4.5 0 2.2 0 2.7 0 5.2 0 4.5 0 0.0 0 Inf 0 -Inf 0 -Inf 0 2.7 0 4.5 0 2.3 0 3.5 0 4.3 0 2 0 4.5 0 7.3 0 4.5 0 1.3 0 Each state as a token t t.probj t.startj • frame at which the token at • state j enter the model w.model 1 0 w.start

  38. 0.1 0.3 0.2 0.2 0.3 0.5 0.7 0.8 0.3 0.5 0.9 0.4 0.5 0.1 0.6 1.0 1.0 1.0 0.3 0.5 0.6 0.7 0.3 0.2 For N subunits

  39. Recovering the uttered words. • Array w, is the same length as the number of observations. • This array gives information of the limits of the words. • At the end of the utterance, value w.model[M] stores the last HMM recognised sequence, and the predecesor models are obtain by “tracking back” trough the array.

  40. Finite State Syntax (FSS) • With token passing the FSS are straightforward implemented. Monterrey Monterrey to Frankfort Frankfort from Houston Houston

  41. Initial Estimates • Reestimation equations give parameter values which converge to a local maximum. • Experience shown that • aijparameters converge to the global parameter without problem. • state distribution function (bi,o(t) ) or parameters (mean and variance) need good initial estimates. • Segmental k-means Segmentation into States

  42. Segmental k-means Segmentation into states Chosen randomly or any available model. Using Viterby Algorithm and back tracking. Number of vectors k bj(k)= Number of vectors in j. m =Mean of the vector in the state j j s =Mean of the vector in the state j j

  43. State Duration HMM Modelling • For most physical signals, the exponential distribution is inappropriate. • In order to improve modeling, we have to incorporate state duration information in a HMM. • Incorporate state duration information into the mechanics of HMMs. • Heuristic method.

  44. Heuristic for incorporating state duration into HMM • At training: • Segmental k-means algorithm is used. • Calculate the state duration probability pj(d). • At recognition: • Viterbi algorithm to obtain • The logProbability and • the best segmentation via backtracking. • The duration of each state is measure from the state segmentation. • A post processor increases the logProbability as follows:

  45. 0.00477 0.0 0.0 0.024 0.01008 0.00954 0.0 0.0 0.28 0.0204 0.00311 0.0 0.8 0.16 0.00256 0.00005 0.0 0.0 0.0 0.0 Implementation Advise • If the number of models, in the decoding process, is very large, then try to save memory. Observe that for a given time t, it only need a limited quantity of information: s t

  46. s 0.00477 0.00932 0.0 0.0 0.0 0.0 0.037 0.024 0.01008 0.02008 0.0154 0.00954 0.0 0.0 0.0 0.0 0.28 0.61 0.0204 0.074 0.00561 0.00311 0.0 0.0 0.8 0.8 0.16 0.32 0.00256 0.0046 0.00025 0.00005 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 s

  47. s s 0.0133 0.01972 0.0035 0.00932 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.037 0.037 0.037 0.037 0.02008 0.02008 0.02008 0.02008 0.0154 0.0154 0.0154 0.0154 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.61 0.61 0.61 0.61 0.074 0.074 0.074 0.074 0.00561 0.00561 0.00561 0.00561 0.0 0.0 0.0 0.0 0.8 0.8 0.8 0.8 0.32 0.32 0.32 0.32 0.00025 0.00025 0.00025 0.0046 0.0046 0.0046 0.0046 0.00025 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 s s

More Related