PPT - PowerPoint Presentation, free download

􀀀 The Use of Context in Large Vocabulary Speech Recognition Julian James OdellMarch 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Hsu-Ting Wei

Context

Contents (cont.)

Introduction • The use of context dependent models introduces two major problems: • 1. Sparsely and unevenness training data • 2. Efficient decoding strategy which incorporates context dependencies both within words and across word boundaries

Introduction (cont.) • About problem 1 (ch3) • Construct a robust and accurate recognizers using decision tree bases clustering techniques • Linguistic knowledge is used • The approach allows the construction of models which are dependent upon contextual effects occurring across word boundaries • About problem 2 (ch4~) • The thesis presents a new decoder design which is capable of using these models efficiently • The decoder can generate a lattice of word hypotheses with little computational overhead.

Ch3 . Context dependency in speech • 3.1 Contextual Variation • In order to maximize the accuracy of HMM based speech recognition systems, it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses. • Signal parameterisation • Model structure • Ensure that their between class variance is higher than the within class variance

Ch3 . Context dependency in speech (cont.) • Most of the variability inherent in speech is due to contextual effects: • Session effects • Speaker effects • Major source of variation • Environmental effects • Control by minimizing the background noise and ensuring that the same microphone is used • Local effects • Utterance • Co-articulation, stress, emphasis • By taking these contextual effects into account, the variability can be reduced and the accuracy of the models increased.

Ch3 . Context dependency in speech (cont.) • Session effects • Speaker dependent system (SD) is significantly more accurate than a similar speaker independent system (SI). • Speaker effects • Gender and age • Dialect • Style • In order to making the SI system to simulate SD system, we can do : • Operating recognizers in parallel • Adapting the recognizer to match the new speaker

Ch3 . Context dependency in speech (cont.) • Session effects (cont.) • Operating recognizers in parallel • Disadvantage: • The computational load appears to rises linearly with the number of systems • Advantage: • The system tends to dominate quickly and the computational load is high for only the first few seconds of speech answer Speaker type

Ch3 . Context dependency in speech (cont.) • Session effects (cont.) • Adapting the recognizer to match the new speaker • Problem: There is insufficient data to update the model • It is possible to make use of both techniques and initially use parallel systems to choose the speaker characteristics, then, once enough data is available, adapt the chosen system to better match the speaker. MAP,MLLR

Ch3 . Context dependency in speech (cont.) • Local effects • Co-articulation means that the acoustic realization of a phone in a particular phonetic context is more consistent than the same phone occurring in a variety of contexts. • Ex: ”We were away with William in Sea World.” w iy w er… s iy w er

Ch3 . Context dependency in speech (cont.) • Local effects • Context Dependent Phonetic Models • IN LIMSI • 45 monophone context (Festival CMU: 41) • STEAK = sil s t ey k sil • 2071 biphone context (Festival CMU :1364) • STEAK = sil sil-s s-t t-ey ey-k sil • 95221 triphone context • STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil • Word Boundaries • Word Internal Context Dependency (Intra-word) • STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n-d ch+ih ch-ih+p ih-p+s p-s sil • Cross World Context Dependency (Inter-word) =>can increase accuracy • STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil

English dictionary • Festlex CMU - Lexicon (American English) for Festival Speech System (2003-2006) • 40 distinct phones . . ("hello" nil (((hh ax l) 0) ((ow) 1))) . . ("world" nil (((w er l d) 1)))..

English dictionary (cont.) • The LIMSI dictionary phones set (1993) • 45 phones

Linguistic knowledge (cont.) • General questions 鼻音摩擦音流音

Linguistic knowledge (cont.) • Vowel questions

Linguistic knowledge (cont.) • Consonant questions 發音時很用力的子音發音較不費力的子音舌尖音刺耳的音節主音摩擦音破擦音

Linguistic knowledge (cont.) • Questions which is used in HTK <= State tying

Ch4.Decoding • This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. • It is concerned with the use of cross word context dependent acoustic and long span language models. • Ideal decoder • 4.2 Time-Synchronous decoding • 4.2.1 Token passing • 4.2.2 Beam pruning • 4.2.3 N-Best decoding • 4.2.4 Limitations • 4.2.5 Back-Off implementation • 4.3 Best First Decoding • 4.3.1 A* Decoding • 4.3.2 The stack decoder for speech recognition • 4.4 A Hybrid approach

Ch4.Decoding (cont.) 4.1 Requirements • Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance • Acoustic model likelihood • Language model likelihood

Ch4.Decoding (cont.) 4.1 Requirements (cont.) • The ideal decoder would have following characteristics • Efficiency: Ensure that the system does not lag behind the speaker. • Accuracy: Find the most likely grammatical sequence of words for each utterance. • Scalability (可擴放性): (?) The computation required by the decoder would also increase less than linearly with the size of the vocabulary. • Versatility(多樣性): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. (n-gram language + cross-word context dependent models)

Conclusion • Implement HTK right biphone task and triphone task

Presentation Transcript

the promise of speech recognition

Large Vocabulary Continuous Speech Recognition (LVCSR)

DIGITAL SIGNAL PROCESSING ARCHITECTURE FOR LARGE VOCABULARY SPEECH RECOGNITION

Syllables and Concepts in Large Vocabulary Continuous Speech Recognition

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

Vocabulary in Context

VOCABULARY in CONTEXT

Vocabulary in Context

TANDEM ACOUSTIC MODELING IN LARGE-VOCABULARY RECOGNITION

Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition

A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition

Vocabulary in Context

Boosting HMM acoustic models in large vocabulary speech recognition

Large scale discriminative training for speech recognition

The Use of Speech in Speech-to-Speech Translation

Facilitating Use of Speech Recognition Software

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection

Large Vocabulary Unconstrained Handwriting Recognition

CMU Robust Vocabulary-Independent Speech Recognition System

A New Verification-Based Fast-Match for Large Vocabulary Continuous Speech Recognition

Applications of Large Vocabulary Continuous Speech Recognition for Fatigue Detection