Speech in Multimedia

Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007

Outline • Introduction • Topics in speech processing • Speech coding • Speech recognition • Speech synthesis • Speaker verification/recognition • Conclusion

Introduction • Speech is our basic communication tool. • We have been hoping to be able to communicate with machines using speech. C3PO and R2D2

Speech Production Model Anatomy Structure Mechanical Model

Characteristics of Digital Speech Waveform Speech Spectrogram

Voiced and Unvoiced Speech Silence unvoiced voiced

Short-time Parameters Short time power Waveform Envelop

Zero crossing rate Pitch period

Speech Coding • Similar to images, we can also compress speech to make it smaller and easier to store and transmit. • General compression methods such as DPCM can also be used. • More compression can be achieved by taking advantage of the speech production model. • There are two classes of speech coders: • Waveform coder • Vocoder

LPC Speech Coder Vocal track Parameter Quantizer speech Pitch Speech buffer Speech Analysis Code generation Code stream Voiced/ unvoiced Energy Parameter Frame n+1 Frame n

LPC and Vocal Track • Mathematically, speech can be modeled as the following generation model: • {a1, a2, …, ak} are called Linear Prediction Coefficients (LPC), which can be used to model the shape of vocal track. • e(n) is the excitation to generate the speech. x(n) = åp=1k ap x(n-p) + e(n)

Decoding and Speech Synthesis Pitch Period Impulse Train Generator Glottal Pulse Generator Gain Vocal Track Model Radiation Model speech Random Noise Generator U/V

An Example for Synthesizing Speech Glottal Pulse Go through vocal track filter with gain control Blending region Go through radiation filter

LPC10 (FS1015) • 2.4kbps LPC10 was DOD speech coding standard for voice communication at 2.4kbps. • LPC10 works on speech of 8Hz, using a 22.5ms frame and 10 LPC coefficients. Original Speech LPC Decoded Speech

Mixed Excitation LP • For real speech, the excitation is usually not pure pulse or noise but a mixture. • The new 2.4kbps standard (MELP) addresses this problem. Gain Bandpass filter w pulses Vocal Track Model Radiation Model speech + Bandpass filter noise 1-w Original Speech MELP Decoded Speech

Hybrid Speech Codecs • For higher bit rate speech coders, hybrid speech codecs have more advantage than vocoders. • FS1016: CELP (Code Excitation Linear Predictive) • G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for multimedia communication through Internet. • G.729: CELP based codec at 8kbps. code speech “perceptual” comparison Model parameter generation Analysis by Synthesis Speech synthesis Sound at 5.3kbps Sound at 6.3kbps Sound at 8kbps

Speech Recognition • Speech recognition is the foundation of human computer interaction using speech. • Speech recognition in different contexts • Dependent or independent on the speaker. • Discrete words or continuous speech. • Small vocabulary or large vocabulary. • In quiet environment or noisy environment. Reference patterns speech Comparison and decision algorithm Parameter analyzer Words Language model

How does Speech Recognition Work? Words: grey whales Phonemes: g r ey w ey l z Each phoneme has different characteristics (for example, The power distribution).

Speech Recognition g g r ey ey ey ey w ey ey l l z How do we “match” the word when there are time and other variations?

Hidden Markov Model P12 S1 S2 {a,b,c,…} {a,b,c,…} S3 {a,b,c,…}

Dynamic Programming in Decoding time states We can find a path that corresponds to max-probable phonemes to generate the observation “feature” (extracted in each speech frame) sequence.

HMM for a Unigram Language Model HMM1 (word1) p1 HMM2 (word2) s0 p2 p3 HMM3 (wordn)

Speech Synthesis • Speech synthesis is to generate (arbitrary) speech with desired prosperities (pitch, speed, loudness, articulation mode, etc.) • Speech synthesis has been widely used for text-to-speech systems and different telephone services. • The easiest and most often used speech synthesis method is waveform concatenation. Increase the pitch without changing the speed

Speaker Recognition • Identifying or verifying the identity of a speaker is an application where computer exceeds human being. • Vocal track parameter can be used as a feature for speaker recognition. Speaker one Speaker two LPC covariance feature

Applications Speech recognition Call routing Document input Operator Services Voice Commands Directory Assistance Speaker recognition Speech Coding Voice over Internet Fraud Control Wireless Telephone Document Correction Personalized service Speech Interface Text-to-Speech synthesis

Speech in Multimedia

Speech in Multimedia

Presentation Transcript

Multimedia in Web

Sound in multimedia

MULTIMEDIA IN EDUCATION

Assessment in Multimedia

Careers in Multimedia

Multimedia in Web

Multimedia Data Speech and Audio

Multimedia in Organisations

Multimedia in Organisations

Multimedia Communications (371) Speech and Image Communications (348)

speech in, speech out

Multimedia in Organisations

Sound in Multimedia

Multimedia in Handhelds

Multimedia in Organisations

Multimedia in Organisations

Multimedia in Organisations

Multimedia in Organisations

Multimedia in Organisations

Multimedia Data Speech and Audio