240 likes | 479 Views
EE2F1 Multimedia (1): Speech & Audio Technology Lecture 9: Speech Coding Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham. What is speech coding?. Digitisation of speech for transmission or storage
E N D
EE2F1Multimedia (1): Speech & Audio TechnologyLecture 9: Speech CodingMartin RussellElectronic, Electrical & Computer EngineeringSchool of EngineeringThe University of Birmingham
What is speech coding? • Digitisation of speech for transmission or storage • Aim to minimise bits per second (bps)… …while preserving speech quality: • intelligibility and naturalness • Main kinds of speech coding scheme: • waveform coder • vocoder
Approaches • Waveform coding • Work for all audio signals • Generic methods for bit reduction • Exploit properties of human hearing • Vocoders • Optimised for speech coding • Assume that the signal to be encoded is speech
Waveform coders • PCM (Pulse Code Modulation) • DPCM (Differential PCM) • ADPCM (Adaptive Differential PCM) • Delta modulation (1 bit ADPCM)
Pulse Code Modulation (PCM) Quantization error • How many quantization points? • How many samples per second (sample rate)? Sample point Quantization point
Differential PCM Quantization error • Encode the differences between values at successive quantisation points Sample point Quantization point
Adaptive DPCM • Use small number of bits to encode differences in DPCM • Adjust quantisation step size to accommodate large changes in the signal
Delta Modulation • 1 bit ADPCM • Sequence of ‘all 1s’ or ‘all 0s’ indicates need to change step size 1 1 1 0 0 0 0 0 ‘Slope Overload’ indicated by excessive use of1s or 0s
Waveform coding summarised • PCM, with 8 bits per sample (amplitude compression) and 8 kHz sampling rate, gives a bit rate of 64 kbps • DPCM (aka. Delta PCM), difference between samples needs fewer bits for same accuracy • Adaptive DPCM, scaling of bits varied, depending on dynamic range • Delta modulation = 1-bit DPCM • can adapt step size to avoid slope overload • gives reasonable intelligibility at just 16 kbps
Vocoders • Coders designed specifically for speech • Sometimes called analysis-synthesis coders • Exploit source-filter model of speech
Vocoders • Encoding • Estimate and encode source • Estimate and encode vocal tract filter • Store as feature vector • Transmission • Transmit at low data rate (~50-100 vectors per second) • Can do this because of relatively slow movement of vocal tract • Decoding • Recover source information • Recover vocal tract filter information • Convert into synthesiser control parameters • Synthesise speech
Example: Channel Vocoder 19 band-pass filters, spanning 0-4 kHz centre-frequencies arranged non-linearly on frequency axis bandwidths increase with frequency, like ear’s critical bands Energies from filter outputs averaged over 20 ms
Example: Channel Vocoder Spectrum shape (Filter-bank energies) coded by DPCM Combined with binary ‘voiced/unvoiced’ flag plus estimate of fundamental frequency f0if ‘voiced’ 1 ‘frame’ of data (48bits) transmitted 50 times per second 2,400 bps
Example: Channel Vocoder Voiced/unvoiced flag plus f0 used to select source Spectrum shape decoded and used to configure filterbank
Example: Channel Vocoder Analyser Synthesiser
Linear Predictive Coding (LPC) • Basic idea • Assume that value of speech signal at time t can be written as a weighted sum of its values at times t-1, t-2,…, t-N • Nth order Linear Predictive Coding (LPC) • The coefficients a0,…,aNcan be thought of as the parameters of a digital filter (lecture 3) • They define the vocal tract filter at time t • Used in LPC vocoder
Finite Impulse Response (FIR) digital filter y(n) x(n) Z-1 a1 Z-1 a2 Z-1 aN Z-1
LPC Vocoders • Quality of LPC vocoded speech depends critically on the quality of the excitation signal • Two particular forms of LPC used for speech coding in GSM mobile phones • RELP: Residual Excited LPC • CELP: Codebook Excited LPC
Example: CELP Vocoders • Vocal tract filter: • LPC analysis conducted over short (~20ms) section of speech to give LPC coefficients • Source • Excitation source estimated over window • Compared with a finite set of ‘reference’ excitation signals e1,…,eC. • Code for most similar reference transmitted • The set of references is called a codebook • Hence Codebook Excited LPC
Formant Vocoder • A formant vocoder exploits the importance of F1, F2 and F3 for speech perception • Formant frequencies, amplitudes and bandwidths estimated and used to model vocal tract filter • Transmitted, with V/UV and f0 information at 50-100 frames per second • Speech decoded using a formant synthesiser • Using 5-6 bits for each of the 10 control parameters results in 2.5-6 kbps bit rate
Input Speech Output Speech “recce report…” “recce report…” Speech Speech Recognizer Synthesiser Phone-level Phone-level transcription transcription r E k i r @ p O t .. r E k i r @ p O t .. 50 bps Transmitter Receiver Recognition-Synthesis Coder
Recognition-synthesis coders • New technology – still in research labs • Very low data rates: • Sounds of English (~46 phonemes) can be specified using 6 bits • Talking at 8 phonemes per second, the linguistic content can be encoded in just 50 bps! • Computationally complex
Use of ‘knowledge’ • Bit rates reduced by exploiting properties of the the speech signal: • waveform coders: limited bandwidth • vocoders: signal contains resonances • recognition-synthesis: signal is speech • Highest-level models give lowest bit rates • Paralinguistic properties of the speech are sacrificed: • speaker’s identity • state of health • emotional/psychological state
Summary of coding • Waveform coders • PCM, DPCM, ADPCM • Delta modulation • Vocoders • Channel vocoder, RELP, CELP • Segment vocoder • Recognition-synthesis coders • Trade-offs