1 / 60

Fundamentals of Speech Signal Processing

Fundamentals of Speech Signal Processing. 1.0 Speech Signals. Waveform plots of typical vowel sounds - Voiced (濁音). tone 2. tone 1. tone 4. t. Speech Production and Source Model. Human vocal mechanism. Speech Source Model. Vocal tract. x(t). u(t). Voiced and Unvoiced Speech.

fredricke
Download Presentation

Fundamentals of Speech Signal Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fundamentals of Speech Signal Processing

  2. 1.0 Speech Signals

  3. Waveform plots of typical vowel sounds - Voiced(濁音) tone 2 tone 1 tone 4 t

  4. Speech Production and Source Model • Human vocal mechanism • Speech Source Model Vocal tract x(t) u(t)

  5. Voiced and Unvoiced Speech x(t) u(t) voiced pitch pitch unvoiced

  6. Unvoiced (清音) Voiced (濁音) Waveform plots of typical consonant sounds

  7. Waveform plot of a sentence

  8. Voiced Unvoiced Frequency domain spectra of speech signals

  9. Frequency Domain Voiced formant frequencies

  10. Frequency Domain Unvoiced formant frequencies

  11. Spectrogram

  12. Spectrogram

  13. Formant Frequencies

  14. Formant frequency contours He will allow a rare lie. Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang

  15. 2.0 Speech Signal Processing

  16. Speech Signal Processing ^ x[n] • Major Application Areas • Speech Coding:Digitization and Compression • Considerations : 1) bit rate (bps) • 2) recovered quality • 3) computation • complexity/feasibility • Voice-based Network Access — • User Interface, Content Analysis, User-content Interaction x(t) x[n] Processing Algorithms LPF output • Speech Signals • Carrying Linguistic Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc. • Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level • Processing and Interaction of the Double-level Information x[n] xk 110101… Inverse Processing Processing Storage/transmission

  17. Sampling of Signals X[n] X(t) n t

  18. Double Levels of Information 字(Character) 詞(Word) 人人用電腦 電腦 句(Sentence)

  19. 今 天 的 天 氣 非 常 好 今天的 天氣 非常 好 今天 的 • Speech Signal • Sampling • Processing Speech Signal Processing – Processing of Double-Level Information Algorithm Chips or Computers • Linguistic Structure • Linguistic Knowledge Lexicon Grammar

  20. Voice-based Network Access Internet User Interface Content Analysis User-Content Interaction • User Interface —when keyboards/mice inadequate • Content Analysis — help in browsing/retrieval of multimedia content • User-Content Interaction —all text-based interaction can be accomplished by spoken language

  21. User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals Text Content Internet Networks Multimedia Content • at Any Time, from Anywhere • Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-free Interfaces, Home Appliances, Wearable Devices… • Small in Size, Light in Weight, Ubiquitous, Invisible… • Post-PC Era • Keyboard/Mouse Most Convenient for PC’s not Convenient any longer — human fingers never shrink, and application environment is changed • Service Requirements Growing Exponentially • Voice is the Only Interface Convenient for ALL User Terminals at Any Time, from Anywhere, and to the point in one utterance • Speech Processing is the only less mature part in the Technology Chain

  22. Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content Future Integrated Networks • Real–time • Information • weather, traffic • flight schedule • stock price • sports scores • Private Services • personal notebook • business databases • home appliances • network • entertainments • Intelligent Working • Environment • e–mail processors • intelligent agents • teleconferencing • distant learning • electric commerce • Knowledge • Archieves • digital libraries • virtual museums • Special Services • Google • FaceBook • YouTube • Amazon • Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text) • Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse • The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval • Multimedia Content Analysis based on Speech Information

  23. User-Content Interaction— Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing text information Multimedia Content Text-to-Speech Synthesis Text Content voice information Spoken and multi-modal Dialogue Voice-based Information Retrieval Internet voice input/ output MultimediaContent Analysis Text Information Retrieval • Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech • User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues • Hand-held Devices with Multimedia Functionalities Commonly used Today • Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified by Speech Information

  24. 3.0 Speech Coding

  25. Waveform-based Approaches • Pulse-Coded Modulation (PCM) • binary representation for each sample x[n] by quantization • Differential PCM (DPCM) • encoding the differences • d[n] = x[n]  x[n1] • d[n] = x[n] akx[nk] • Adaptive DPCM (ADPCM) • with adaptive algorithms • Ref : Haykin, “Communication Systems”, 4-th Ed. • 3.7, 3.13, 3.14, 3.15 P k=1

  26. Speech Source Model and Source Coding • Speech Source Model G(),G(z), g[n] Ex u[n] x[n] Excitation Generator Vocal Tract Model U () U (z) x[n]=u[n]g[n] X()=U()G() X(z)=U(z)G(z) parameters parameters • digitization and transmission of the parameters will be adequate • at receiver the parameters can produce x[n] with the model • much less parameters with much slower variation in time lead to much less bits required • the key for low bit rate speech coding

  27. Speech Source Model x(t) t a[n] n

  28. Speech Source Model and Source Coding • Analysis and Synthesis • High computation requirements are the price for low bit rate

  29. Simplified Speech Source Model G(z), G(), g[n] unvoiced voiced x[n] random sequence generator G(z) = 1 1 akz-k u[n]  P k = 1 periodic pulse train generator G Vocal Tract Model v/u N Excitation • Vocal Tract parameters • {ak} : LPC coefficients • formant structure of speech signals • A good approximation, though not precise enough • Excitation parameters • v/u : voiced/ unvoiced • N : pitch for voiced • G : signal gain •  excitation signal u[n]

  30. LPC Vocoder(Voice Coder) N by pitch detection v/u by voicing detection {ak} can be non-uniform or vector quantized to reduce bit rate further Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993

  31. Multipulse LPC • poor modeling of u(n) is the main source of quality degradation in LPC vocoder • u[n] replaced by a sequence of pulses • u[n] = bk[nnk] • roughly 8 pulses per pitch period • u[n] close to periodic for voiced • u[n] close to random for unvoiced • Estimating (bk , nk) is a difficult problem • k

  32. Multipulse LPC • Estimating (bk , nk) is a difficult problem • analysis by synthesis • large amount of computation is the price paid for better speech quality

  33. Multipulse LPC • Perceptual Weighting • W(z) = (1ak z -k)/(1akck z -k) • 0 < c < 1 for perceptual sensitivity • W(z) = 1 , if c = 1 • W(z) = 1ak z -k , if c = 0 • practically c 0.8 • Error Evaluation • E = |X()  X() |2 W()d • P k = 1 • P k = 1 ^

  34. Multipulse LPC • Error Minimization and Pulse search • u[n] = bk[nnk] • x[n] = bkg[nnk] • E = E ( b1 , n1 ,b2 , n2……) • sub-optional solution finding 1 pulse at a time • k ^ • k

  35. Code-Excited Linear Prediction (CELP) • Use of VQ to Construct a Codebook of Excitation Sequences • a sequence consists of roughly 40 samples • a codebook of 512 ~ 1024 patterns is constructed with VQ • roughly 512 ~ 1024 excitation patterns are perceptually adequate • Excitation Search –analysis by synthesis • 9 ~ 10 bits are needed for the excitation of 40 samples , while {} parameters in G(z) also vector quantized

  36. Code-Excited Linear Prediction (CELP) • Receiver • {} codewords can be transmitted less frequently than excitation codeword • Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33

  37. 4.0Speech Recognition and Voice-based Network Access

  38. Speech Recognition as a pattern recognition problem W X x(t) Feature Extraction Pattern Matching Decision Making unknown speech signal output word feature vector sequence y(t) Y Reference Patterns Feature Extraction training speech

  39. Basic Approach for Large Vocabulary Speech Recognition Input Speech Feature Vectors Linguistic Decoding and Search Algorithm Output Sentence Front-end Signal Processing Language Model Acoustic Model Training Speech Corpora Acoustic Models Text Corpora Language Model Construction Lexicon • A Simplified Block Diagram • Example Input Sentence • this is speech • Acoustic Models (聲學模型) • (th-ih-s-ih-z-s-p-ih-ch) • Lexicon (th-ih-s) → this • (ih-z) → is • (s-p-iy-ch) → speech • Language Model(語言模型)(this) – (is) – (speech) • P(this) P(is | this) P(speech | this is) • P(wi|wi-1) bi-gramlanguage model • P(wi|wi-1,wi-2) tri-gram language model,etc

  40. Observation Sequences

  41. State Transition Probabilities 1-dim Gaussian Mixtures

  42. Simplified HMM RGBGGBBGRRR……

  43. Peripheral Processing for Human Perception

  44. Mel-scale Filter Bank

  45. N-gram W1W2W3 W4 W5W6 ......WR tri-gram W1W2W3W4W5W6 ......WR

More Related