1 / 25

Speech in Multimedia

Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction.

lamar
Download Presentation

Speech in Multimedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007

  2. Outline • Introduction • Topics in speech processing • Speech coding • Speech recognition • Speech synthesis • Speaker verification/recognition • Conclusion

  3. Introduction • Speech is our basic communication tool. • We have been hoping to be able to communicate with machines using speech. C3PO and R2D2

  4. Speech Production Model Anatomy Structure Mechanical Model

  5. Characteristics of Digital Speech Waveform Speech Spectrogram

  6. Voiced and Unvoiced Speech Silence unvoiced voiced

  7. Short-time Parameters Short time power Waveform Envelop

  8. Zero crossing rate Pitch period

  9. Speech Coding • Similar to images, we can also compress speech to make it smaller and easier to store and transmit. • General compression methods such as DPCM can also be used. • More compression can be achieved by taking advantage of the speech production model. • There are two classes of speech coders: • Waveform coder • Vocoder

  10. LPC Speech Coder Vocal track Parameter Quantizer speech Pitch Speech buffer Speech Analysis Code generation Code stream Voiced/ unvoiced Energy Parameter Frame n+1 Frame n

  11. LPC and Vocal Track • Mathematically, speech can be modeled as the following generation model: • {a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track. • e(n) is the excitation to generate the speech. x(n) = åp=1k ap x(n-p) + e(n)

  12. Decoding and Speech Synthesis Pitch Period Impulse Train Generator Glottal Pulse Generator Gain Vocal Track Model Radiation Model speech Random Noise Generator U/V

  13. An Example for Synthesizing Speech Glottal Pulse Go through vocal track filter with gain control Blending region Go through radiation filter

  14. LPC10 (FS1015) • 2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps. • LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients. Original Speech LPC Decoded Speech

  15. Mixed Excitation LP • For real speech, the excitation is usually not pure pulse or noise but a mixture. • The new 2.4kbps standard (MELP) addresses this problem. Gain Bandpass filter w pulses Vocal Track Model Radiation Model speech + Bandpass filter noise 1-w Original Speech MELP Decoded Speech

  16. Hybrid Speech Codecs • For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders. • FS1016: CELP (Code Excitation Linear Predictive) • G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet. • G.729: CELP based codec at 8kbps. code speech “perceptual” comparison Model parameter generation Analysis by Synthesis Speech synthesis Sound at 5.3kbps Sound at 6.3kbps Sound at 8kbps

  17. Speech Recognition • Speech recognition is the foundation of human computer interaction using speech. • Speech recognition in different contexts • Dependent or independent on the speaker. • Discrete words or continuous speech. • Small vocabulary or large vocabulary. • In quiet environment or noisy environment. Reference patterns speech Comparison and decision algorithm Parameter analyzer Words Language model

  18. How does Speech Recognition Work? Words: grey whales Phonemes: g r ey w ey l z Each phoneme has different characteristics (for example, The power distribution).

  19. Speech Recognition g g r ey ey ey ey w ey ey l l z How do we “match” the word when there are time and other variations?

  20. Hidden Markov Model P12 S1 S2 {a,b,c,…} {a,b,c,…} S3 {a,b,c,…}

  21. Dynamic Programming in Decoding time states We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.

  22. HMM for a Unigram Language Model HMM1 (word1) p1 HMM2 (word2) s0 p2 p3 HMM3 (wordn)

  23. Speech Synthesis • Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.) • Speech synthesis has been widely used for text-to-speech systems and different telephone services. • The easiest and most often used speech synthesis method is waveform concatenation. Increase the pitch without changing the speed

  24. Speaker Recognition • Identifying or verifying the identity of a speaker is an application where computer exceeds human being. • Vocal track parameter can be used as a feature for speaker recognition. Speaker one Speaker two LPC covariance feature

  25. Applications Speech recognition Call routing Document input Operator Services Voice Commands Directory Assistance Speaker recognition Speech Coding Voice over Internet Fraud Control Wireless Telephone Document Correction Personalized service Speech Interface Text-to-Speech synthesis

More Related