E N D
The Use of Context in Large Vocabulary Speech Recognition Julian James OdellMarch 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Hsu-Ting Wei
Introduction • The use of context dependent models introduces two major problems: • 1. Sparsely and unevenness training data • 2. Efficient decoding strategy which incorporates context dependencies both within words and across word boundaries
Introduction (cont.) • About problem 1 (ch3) • Construct a robust and accurate recognizers using decision tree bases clustering techniques • Linguistic knowledge is used • The approach allows the construction of models which are dependent upon contextual effects occurring across word boundaries • About problem 2 (ch4~) • The thesis presents a new decoder design which is capable of using these models efficiently • The decoder can generate a lattice of word hypotheses with little computational overhead.
Ch3 . Context dependency in speech • 3.1 Contextual Variation • In order to maximize the accuracy of HMM based speech recognition systems, it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses. • Signal parameterisation • Model structure • Ensure that their between class variance is higher than the within class variance
Ch3 . Context dependency in speech (cont.) • Most of the variability inherent in speech is due to contextual effects: • Session effects • Speaker effects • Major source of variation • Environmental effects • Control by minimizing the background noise and ensuring that the same microphone is used • Local effects • Utterance • Co-articulation, stress, emphasis • By taking these contextual effects into account, the variability can be reduced and the accuracy of the models increased.
Ch3 . Context dependency in speech (cont.) • Session effects • Speaker dependent system (SD) is significantly more accurate than a similar speaker independent system (SI). • Speaker effects • Gender and age • Dialect • Style • In order to making the SI system to simulate SD system, we can do : • Operating recognizers in parallel • Adapting the recognizer to match the new speaker
Ch3 . Context dependency in speech (cont.) • Session effects (cont.) • Operating recognizers in parallel • Disadvantage: • The computational load appears to rises linearly with the number of systems • Advantage: • The system tends to dominate quickly and the computational load is high for only the first few seconds of speech answer Speaker type
Ch3 . Context dependency in speech (cont.) • Session effects (cont.) • Adapting the recognizer to match the new speaker • Problem: There is insufficient data to update the model • It is possible to make use of both techniques and initially use parallel systems to choose the speaker characteristics, then, once enough data is available, adapt the chosen system to better match the speaker. MAP,MLLR
Ch3 . Context dependency in speech (cont.) • Local effects • Co-articulation means that the acoustic realization of a phone in a particular phonetic context is more consistent than the same phone occurring in a variety of contexts. • Ex: ”We were away with William in Sea World.” w iy w er… s iy w er
Ch3 . Context dependency in speech (cont.) • Local effects • Context Dependent Phonetic Models • IN LIMSI • 45 monophone context (Festival CMU: 41) • STEAK = sil s t ey k sil • 2071 biphone context (Festival CMU :1364) • STEAK = sil sil-s s-t t-ey ey-k sil • 95221 triphone context • STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil • Word Boundaries • Word Internal Context Dependency (Intra-word) • STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil • Cross World Context Dependency (Inter-word) =>can increase accuracy • STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
English dictionary • Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) • 40 distinct phones . . ("hello" nil (((hh ax l) 0) ((ow) 1))) . . ("world" nil (((w er l d) 1)))..
English dictionary (cont.) • The LIMSI dictionary phones set (1993) • 45 phones
Linguistic knowledge (cont.) • General questions 鼻音 摩擦音 流音
Linguistic knowledge (cont.) • Vowel questions
Linguistic knowledge (cont.) • Consonant questions 發音時很用力的子音 發音較不費力的子音 舌尖音 刺耳的 音節主音 摩擦音 破擦音
Linguistic knowledge (cont.) • Questions which is used in HTK <= State tying
Ch4.Decoding • This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. • It is concerned with the use of cross word context dependent acoustic and long span language models. • Ideal decoder • 4.2 Time-Synchronous decoding • 4.2.1 Token passing • 4.2.2 Beam pruning • 4.2.3 N-Best decoding • 4.2.4 Limitations • 4.2.5 Back-Off implementation • 4.3 Best First Decoding • 4.3.1 A* Decoding • 4.3.2 The stack decoder for speech recognition • 4.4 A Hybrid approach
Ch4.Decoding (cont.) 4.1 Requirements • Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance • Acoustic model likelihood • Language model likelihood
Ch4.Decoding (cont.) 4.1 Requirements (cont.) • The ideal decoder would have following characteristics • Efficiency: Ensure that the system does not lag behind the speaker. • Accuracy: Find the most likely grammatical sequence of words for each utterance. • Scalability (可擴放性): (?) The computation required by the decoder would also increase less than linearly with the size of the vocabulary. • Versatility(多樣性): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. (n-gram language + cross-word context dependent models)
Conclusion • Implement HTK right biphone task and triphone task