Multimedia Communications (371) Speech and Image Communications (348)

Multimedia Communications (371)Speech and Image Communications (348) John Mason Engineering Swansea University EG-348_371_09

Features in speech X1 . . . . Xi . . . . . Feature extraction Acquisition time (frame: 20/30 ms & sampling F: 8khz) EG-348_371_09

Features in speech X1 . . . . Xi . . . . . Feature extraction Acquisition (frame: 20/30 ms & sampling F: 8khz) EG-348_371_09

Speech production Air from the lungs Vocal fold Vocal tract Speech EG-348_371_09

Air from the lungs Vocal fold Vocal tract Speech H1(z) H2(z) synthesised Speech noise LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract EG-348_371_09

Features: building of statistical model T1 T2 T1 T2 T1 T2 T1 T2 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 EG-348_371_09

VT Shape & Some Vowels - Ladefoged ‘62 EG-348_371_09

Speech Processing - Applications • Why? • Communications • Synthesis • Recognition • Speech & Speaker • How? • Frame-based • Systems approach EG-348_371_09

Some Books • Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic! • Furui - several books on recognition • Parsons - `Voice and Speech Processing’ - McGraw Hill, one of the first text books on computer speech processing • O’Shaughnessy - ‘Speech Comms - human and machine’ Addison-Wesley • Rabiner & Juang - ‘Fundamentals of Speech Recognition’ Prentice Hall, 1993 • Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995 EG-348_371_09

Speech Communications Person-to-Person Person-to-Machine speech/speaker recognition Machine-to-Person speech synthesis EG-348_371_09

(Electronic)Speech Communications perhaps separated by long distance (or in time) EG-348_371_09

Telephony & Broadcasting Acoustic Air Path l Transmission Path Acoustic Air Path Electronic Link EG-348_371_09

Channel Transmission Path Electronic Link Speech Comms: Telephony Microphone ADC Analysis Coding Transmitter Receiver Decoding (re-)Synthesis DAC Loudspeaker EG-348_371_09

Human Acoustic generation Transmission Message Creation Language Coding Speech Bit Rates hundreds thousands Tens of thousands tens Approx. bit rate in bps Acoustic Space Human Hearing Extraction Message Realisation Language decoding EG-348_371_09

Excellent Quality Good ADPCM GSM Fair CELP Poor 4 8 16 32 64 kbps Criteria in Speech Comms. Quality versus Bit-rate 4 Quality Measures: intelligibility loudness naturalness ease-of-listening EG-348_371_09

Low Bit Rate Speech CodingCompandent http://www.compandent.com/ EG-348_371_09

Speech Processing The three main application areas are: • Speech Comms. (the ‘electronic link’) • Automatic Speech/Speaker recognition • Speech SynthesisMuch of the underlying analysis is common, eg linear predictive coding EG-348_371_09

What does speech look like? EG-348_371_09

What does speech look like? Dynamic Range - for flexibility and robustness Time-varying - to convey information EG-348_371_09

Frame-based Analysis • To capture time variations: • 20-30 ms frames - ‘centi-second’ labeling • spectral analysis • FFT • Filter-bank • Linear Predictive Coding EG-348_371_09

Excitation: voiced unvoiced sn speech en H(z) Speech Analysis/Coding • Two general cases: • Waveform coders • Source (voice) coders (vo-coders) • Source coders eg linear predictive coding (LPC): • Model the source ie the vocal tract (VT) • Linear, time varying model of VT, plus excitation EG-348_371_09

Systems Approach Excitation Speech Vocal Tract Voiced Speech Model f0 Unvoiced Time Varying Parameters EG-348_371_09

H(z) hn S(z) E(z) en sn E(z) S(z) 1/H(z) sn en LPC Analysis/Synthesis • Synthesis: • Input: Excitation • output: Speech • Analysis: • Input: Speech • output: Excitation EG-348_371_09

S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en ‘Perfect’ Analysis/Synthesis Input sn and output sn are identical (within arithmetic limits) EG-348_371_09

Practical Analysis/Synthesis EG-348_371_09

S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en Transmission Sending Receiving Practical Analysis/Synthesis • Parameters for Transmission : • Input / Excitation en • Source model H(z) • Thus Analysis must derive these parameters, and • Synthesis must use them to re-generate speech EG-348_371_09

 a s s  a s  a s  a s . . . . . . . .  n p p n 1 n  1 n  2 3 n   2 3 Linear Predictive Coding - LPC Principle of linear prediction: • The next value (or sample) in a series, ie at time n, is predicted or estimated by a weighted sum of previous values, ie those at time n-1, n-2, ... • Thus for a predictor of order p, we have: EG-348_371_09

Linear Prediction Transforming to the z-domain gives: EG-348_371_09

LPC Error Terms Error is simply difference between predicted and actual values: sn en + - ˆ sn A’(z) EG-348_371_09

en Synthesis sn H(z) Parameters updated at frame rate sn en +  + A’(z) NB ‘hat’ of approximation omitted for simplicity EG-348_371_09

Synthesis en sn H(z) Analysis Analysis sn en S(z) + E(z) 1/H(z) sn - en A’(z) Analysis for Synthesis • The Analysis and Synthesis must match • what is needed for the Synthesis? • Answer: en - the excitation and H(z) - the system • Thus the Analysis must derive these terms (from sn ): • The speech signal, sn is analysed to give en and H(z) ie A’(z) parameters for transmission. EG-348_371_09

Derivation of LPC Coefficients - A(z) Recall: where ai are the pprediction coefficients.The principle behind LPC is to find a set of pcoefficients, a1, a2, a3, ... ap, which in some sense minimizes the error signal en, over a frame of speech, N. This leads to a set p coefficients for each frame. EG-348_371_09

for i = 1, 2, .… p From which: where: In matrix form: or Derivation of A(z) – (2) Minimisation of En is achieved by setting the ppartial derivatives to zero: The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular. EG-348_371_09

Derivation of A(z) – (3) • When N very large r is the autocorrelation coefficients of s • S comes from e convolved with h (excitation & vocal tract) • we are interested here in separating e and h • the predictor order, p, is small to reflect the short-term periodicities (formants) • with higher predictor orders we will get the longer-term periodicities (pitch) • 2 practical problems with evaluating a: • matrix singularities in R-1 • unstable resultant H(z) • in practice both are solved by windowing - shaping frame - Hamming EG-348_371_09

Speech Signal Characteristics • Duration • Dynamic Range • Periodicities: • vocal tract • pitch • Frame-based Analysis • frame size: quasi-stationary capture transition typically 20 - 30ms • frame rate: task dependent: more means moreband-width/computation - up to 100 frames/second EG-348_371_09

Harmonic Structures and Periodicities • Harmonic Structures & Periodicities give potential for data reduction • LPC is one way of gaining this compression • Speech has two obvious separate structures • vocal tract resonances • pitch EG-348_371_09

Harmonic Structures and Periodicities voiced or unvoiced sn speech en H(z) Vocal tract Short Term Tp p Short term prediction EG-348_371_09

Harmonic Structures and Periodicities voiced unvoiced epn sn speech Hlt(z) Hst(z) en Pitch Vocal tract Tp P Long term prediction EG-348_371_09

k Gain en epn sn Hlt(z) Hst(z) Harmonic Structures and Periodicities Two Structures: short-term (formants) & long-term - pitch (excitation) eg 20ms frame 160 samples @ 8Khz ai eg p=3 ai eg p=10 NB Representations of these parameters are transmitted EG-348_371_09

Practical Coding Systems • Waveform & Source Coders (Vocoders) • 2 periodicities/redundancies in source • short-term (formants) • long-term - pitch • Excitation en en epn sn Hlt(z) Hst(z) EG-348_371_09

S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en ‘Perfect’ Analysis/Synthesis (1) Input sn and output sn are identical (within arithmetic limits) EG-348_371_09

S(z) E(z) E(z) E(z) S(z) S(z) 1 – A’(z) 1/H(z) H(z) sn sn en sn en en ‘Perfect’ Analysis/Synthesis (2) S(z) E(z) 1/(1–A’(z)) en sn en sn sn en 1/(1–A’(z)) 1 – A’(z) EG-348_371_09

sn sn-1 a1 ai sn-i sn-p ‘Perfect’ Analysis/Synthesis (3) sn en sn en 1/(1–A’(z)) 1 – A’(z) Original Speech Residual sn en + -  sn Z-1 Z-1 Note – minus sign: in Matlab combined with ai What determines p? Z-1 ap EG-348_371_09

sn en sn en 1/(1–A’(z)) 1 – A’(z) sn sn-1 a1 a1 ai ai sn-i sn-p ‘Perfect’ Analysis/Synthesis (4) Residual Re-Synth. Original Speech en en sn + + -   sn sn Z-1 Z-1 Note No minus sn-1 Z-1 Z-1 sn-i Z-1 Z-1 sn-p ap ap EG-348_371_09

  S(z) E(z) E(z) S(z) 1/H(z) H(z) sn  en  sn en  Input sn and output sn are “similar” Practical System Transmitted Data Frame What does the Transmitted Data Frame Contain? EG-348_371_09

Analysis-by-Synthesis: LPAS Integrated encoder & decoder at the encoder - sn Basic decoder Adaptive encoder + Weighted error LPAS Encoder EG-348_371_09

Log Spectral Estimates • Comparisons between frames are very important in many situations • log spectral estimates are the most common (though in Comms. An approximation is used to reduce computation) In Comms, compuation is expensive and parameter vector approximations to D are used EG-348_371_09

Some Standards GSM European Cellular RPE-LTP 13kb/s FS1016 Secure Voice CELP 4.8 IS54 NA Cellular VSELP 7.95 IS96 “ QCELP 1-8 JDC-FR Japanese Cellular VSELP 6.7 JDC-HR “ PSI-CELP 3.67 G.728 (terrestrial) LD-CELP 16 EG-348_371_09

Low Bit Rate Speech CodingCompandent http://www.compandent.com/ EG-348_371_09

Excellent Quality Good ADPCM GSM Fair CELP Poor 4 8 16 32 64 kbps Criteria in Speech Comms. Quality versus Bit-rate 4 Quality Measures: intelligibility loudness naturalness ease-of-listening EG-348_371_09

Multimedia Communications (371) Speech and Image Communications (348)