340 likes | 895 Views
Speech in Multimedia. Hao Jiang Computer Science Department Boston College Oct. 9, 2007. Outline. Introduction Topics in speech processing Speech coding Speech recognition Speech synthesis Speaker verification/recognition Conclusion. Introduction.
E N D
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007
Outline • Introduction • Topics in speech processing • Speech coding • Speech recognition • Speech synthesis • Speaker verification/recognition • Conclusion
Introduction • Speech is our basic communication tool. • We have been hoping to be able to communicate with machines using speech. C3PO and R2D2
Speech Production Model Anatomy Structure Mechanical Model
Characteristics of Digital Speech Waveform Speech Spectrogram
Voiced and Unvoiced Speech Silence unvoiced voiced
Short-time Parameters Short time power Waveform Envelop
Zero crossing rate Pitch period
Speech Coding • Similar to images, we can also compress speech to make it smaller and easier to store and transmit. • General compression methods such as DPCM can also be used. • More compression can be achieved by taking advantage of the speech production model. • There are two classes of speech coders: • Waveform coder • Vocoder
LPC Speech Coder Vocal track Parameter Quantizer speech Pitch Speech buffer Speech Analysis Code generation Code stream Voiced/ unvoiced Energy Parameter Frame n+1 Frame n
LPC and Vocal Track • Mathematically, speech can be modeled as the following generation model: • {a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track. • e(n) is the excitation to generate the speech. x(n) = åp=1k ap x(n-p) + e(n)
Decoding and Speech Synthesis Pitch Period Impulse Train Generator Glottal Pulse Generator Gain Vocal Track Model Radiation Model speech Random Noise Generator U/V
An Example for Synthesizing Speech Glottal Pulse Go through vocal track filter with gain control Blending region Go through radiation filter
LPC10 (FS1015) • 2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps. • LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients. Original Speech LPC Decoded Speech
Mixed Excitation LP • For real speech, the excitation is usually not pure pulse or noise but a mixture. • The new 2.4kbps standard (MELP) addresses this problem. Gain Bandpass filter w pulses Vocal Track Model Radiation Model speech + Bandpass filter noise 1-w Original Speech MELP Decoded Speech
Hybrid Speech Codecs • For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders. • FS1016: CELP (Code Excitation Linear Predictive) • G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet. • G.729: CELP based codec at 8kbps. code speech “perceptual” comparison Model parameter generation Analysis by Synthesis Speech synthesis Sound at 5.3kbps Sound at 6.3kbps Sound at 8kbps
Speech Recognition • Speech recognition is the foundation of human computer interaction using speech. • Speech recognition in different contexts • Dependent or independent on the speaker. • Discrete words or continuous speech. • Small vocabulary or large vocabulary. • In quiet environment or noisy environment. Reference patterns speech Comparison and decision algorithm Parameter analyzer Words Language model
How does Speech Recognition Work? Words: grey whales Phonemes: g r ey w ey l z Each phoneme has different characteristics (for example, The power distribution).
Speech Recognition g g r ey ey ey ey w ey ey l l z How do we “match” the word when there are time and other variations?
Hidden Markov Model P12 S1 S2 {a,b,c,…} {a,b,c,…} S3 {a,b,c,…}
Dynamic Programming in Decoding time states We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.
HMM for a Unigram Language Model HMM1 (word1) p1 HMM2 (word2) s0 p2 p3 HMM3 (wordn)
Speech Synthesis • Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.) • Speech synthesis has been widely used for text-to-speech systems and different telephone services. • The easiest and most often used speech synthesis method is waveform concatenation. Increase the pitch without changing the speed
Speaker Recognition • Identifying or verifying the identity of a speaker is an application where computer exceeds human being. • Vocal track parameter can be used as a feature for speaker recognition. Speaker one Speaker two LPC covariance feature
Applications Speech recognition Call routing Document input Operator Services Voice Commands Directory Assistance Speaker recognition Speech Coding Voice over Internet Fraud Control Wireless Telephone Document Correction Personalized service Speech Interface Text-to-Speech synthesis