1 / 17

Chapter 7 Speech Recognition Framework

Chapter 7 Speech Recognition Framework. 7.1 The main form and application of speech recognition 7.2 The main factors of speech recognition 7.3 The active topics of speech recognition 7.4 The basic framework of speech recognition system.

redford
Download Presentation

Chapter 7 Speech Recognition Framework

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 7 Speech Recognition Framework • 7.1 The main form and application of speech recognition • 7.2 The main factors of speech recognition • 7.3 The active topics of speech recognition • 7.4 The basic framework of speech recognition system

  2. 7.1 The main form and application of speech recognition (1) • Speech Recognition --Inputs speech string and generates corresponding word or text of word string or transcription • Speech Understanding --Inputs speech string and generates corresponding response or actions • Speaker Recognition --Inputs speech string and identifies or verifies the speaker • Language Identification --Inputs speech string and identifies which language the input belongs to

  3. The main form and application of speech recognition (2) • Speech Recognition • Speech Navigation • Computer operation, Speech control, Intelligent toys, and Parcels dispatch • Speech Dictation • Dictation machine, Speech dialing, and Broadcasting recording

  4. The main form and application of speech recognition (3) • Speech Understanding • Speech Service • For disables, Banking, Traveling, Transportation, in case dialog is needed • Speech Communication • Bilingual speech communication and Multilingual simultaneous interpretation

  5. The main form and application of speech recognition (4) • Speaker Recognition • Speaker Verification • Accessing to the security department or program, Banking and other service • Speaker Identification • User recognition, Voice checking for searching the criminals

  6. 7.2 The main factors of speech recognition (1) • Speech Style • Isolated Words (IWR) --There are obvious pause (or silence) between words, for example names of person or place, commands • Connected Word Speech (CWR) --For example continuous digit string (telephone numbers or data) • Continuous Speech (CSR) --Natural spoken language in sentence (or utterance). The easy degree is : CSR<<CWR<<IWR

  7. The main factors of speech recognition (2) • Speaker Dependent (SD) • Speaker Dependent Recognition System only can recognize the speech by one or a couple of speaker. The speech model is trained only by the speaker’s speech samples (corpus) • Speaker Independent (SI) • Speaker Independent Recognition System can recognize speech by any speaker. In this case, the speech model is trained by many speaker’s corpuses and speaker adaptation for recognition will improve the performance. It is much harder than SD.

  8. The main factors of speech recognition (3) • Vocabulary Size • Small Vocabulary --containing several hundred words • Middle Vocabulary --containing a thousand to several thousand words • Large Vocabulary --more than 10 thousand words

  9. The main factors of speech recognition (4) • Other factors : • Speech Quality -Microphone speech or telephone speech, recording environment, speaker’s cooperation • Task --Word Recognition, Transcription, Word Spotting, Dialog and Translation are very different task • Domain (specific or generic) and Syntax Constraints (less or more)

  10. 7.3 The active topics of speech recognition • Broadcasting Recording Systems • Telephone Dialog Systems • Speaker Adaptation • Noise Reduction • Word Spotting • Language Models Based on Classes

  11. 7.4 The basic framework of speech recognition system (1) • Input --Speech string (utterance) through microphone or telephone { x’(n) } • Preprocessing --Windowing, Framing and Pre-emphasizing { xi(n) } • Feature Extraction --Feature vector calculation frame by frame { oi } • Decision Making --Simple algorithm such as minimal distance classifier to complex one such as HMM (statistical acoustic and language models).

  12. The basic framework of speech recognition system(2) • Input • Anti-aliasing filter with 300-4KHz • Sampling rate : 8KHz (telephone speech) to 16KHz (microphone speech) • Sampling precision : 8 bits (telephone speech) to 16 bits (microphone speech) • Sampling starting and ending determination (silence detecting and memory buffer to use)

  13. The basic framework of speech recognition system (3) • Preprocessing • Window selection and windowing • Framing --frame length and frame shift selection (typical 25ms and 10ms) • Pre-emphasizing y(n) = x(n) – αx(n-1) • αis close to 1.0 (0.95 or 0.97), for simplicity it could be 15/16 ≈ 0.9375. • The goal is high frequency enhancement

  14. The basic framework of speech recognition system(4) • Feature Extraction • There are a couple of way to get feature vector, here only one is given –MFCC( mel-scale frequency cepstrum coefficients) • The steps to get MFCC for one frame : • 1. FFT (by padding 0) to get X(k) : • X[k] = Σn=0N-1 x[n]exp(-j2πnk/N), k=0~N-1 • 2. Using M filters, with the log-energy S[m] of filter m being computed via the convolution of the power spectrum S[k]=|X[k]|2 with a filter Hm[k] :

  15. The basic framework of speech recognition system (5) • S[m] = log[S[k]*Hm[k]] m=0~N-1 where Hm[k]>=0 and Σk=0N-1 Hm[k] = 1. • Typically Hm[k]are chosen as triangular filters: • Hm[k] = 0 k<f[m-1] • =2(k-f[m-1])/[(f[m+1]-f[m-1])(f[m]-f[m-1])] f[m-1]<=k<f[m] • =2(f[m+1]-k)/[f[m+1]-f[m-1])(f[m+1]-f[m])] f[m]<=k<bf[m+1] • =0 k>f[m+1] • So that Σk=0N-1 Hm[k] = 1 for all m where the boundary points f[m] are uniformly spaced in the mel-scale.

  16. The basic framework of speech recognition system (6) • The mel frequency cepstrum is the DCT of the m filter outputs : • c[n] = Σk=0m-1 S[k]cos(πn(k+1/2)/m), n=0~m-1 • M=24~40, but n is truncated to about 12. • Besides the 12 coefficients, their first and second order of differences are often used as feature vector components too. The total number of the components is about 36~39.

  17. The basic framework of speech recognition system (7) • Decision Making This is the last step to determine what is in the input speech string. Now for isolated word system the template matching is still used. For connected speech or continuous speech the statistical model (HMM and others ) is used. We will discuss them later in details.

More Related