Speech Coding EE 516 Spring 2009

Speech CodingEE 516 Spring 2009 Alex Acero

Acknowledgments • Thanks to Allen Gersho for some slides…

Outline • Quality vs Bit rate • Types of speech coders • Waveform Coding • Speech production and vocoders • Analysis by Synthesis • VoIP

Voice Quality • Bandwidth is easily quantified • Voice quality is subjective • MOS, Mean Opinion Score • ITU-T Recommendation P.800 • Excellent – 5 • Good – 4 • Fair – 3 • Poor – 2 • Bad – 1 • A minimum of 30 people • Listen to voice samples or in conversations

Voice Quality • P.800 recommendation • The selection of participants • The test environment • Explanations to listeners • Analysis of results • Toll quality • A MOS of 4.0 or higher

Quality Measurements • Subjective and objective quality-testing techniques • PSQM – Perceptual Speech Quality Measurement • ITU-T P.861 • faithfully represent human judgement and perception • algorithmic comparison between the output signal and a know input • type of speaker, loudness, delay, active/silence frames, clipping, environmental noise

Eurospeech 2003 Evolution of Speech Coder Performance North American TDMA Excellent Good 2000 Speech Quality Fair 1990 ITU Recommendations Cellular Standards 1980 Secure Telephony Poor 1980 Profile 1990 Profile 2000 Profile Bad Bit Rate (kb/s)

Ceiling Speech Coding(Telephony) • More complicated than Moore’s Law • Many Dimensions: Bit Rate, Quality, Complexity and Delay • Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s)

Ceiling Speech Coding(Telephony) • More complicated than Moore’s Law • Many Dimensions: Bit Rate, Quality, Complexity and Delay • Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant • Bit rates half every decade (≤ 8 kb/s) • Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) • Limited more by physics than investment

Ceiling Speech Coding(Telephony) • More complicated than Moore’s Law • Many Dimensions: Bit Rate, Quality, Complexity and Delay • Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant • Bit rates half every decade (≤ 8 kb/s) • Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) • Limited more by physics than investment • Potential compression opportunity • At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)  50 bits per sec (??) • Speech (2 kb/s) >> text (2 bits/char): 10-1000 times more bits • Speech coding will not close this gap for foreseeable future

Ceiling Speech Coding(Telephony) • More complicated than Moore’s Law • Many Dimensions: Bit Rate, Quality, Complexity and Delay • Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant • Bit rates half every decade (≤ 8 kb/s) • Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) • Limited more by physics than investment • Potential compression opportunity • At most 10x: 8 kb/s  2 kb/s  1 kb/s (?) • Speech (2 kb/s) >> text (2 bits/char): 100-1000 times more bits • Speech coding will not close this gap for foreseeable future

Type of Speech Coders • Waveform codecs • Sample and code • High-quality and not complex • Large amount of bandwidth • Source codecs (vocoders) • Match the incoming signal to a math model • Linear-predictive filter model of the vocal tract • A voiced/unvoiced flag for the excitation • The information is sent rather than the signal • Low bit rates, but sounds synthetic • Higher bit rates do not improve much

Type of Speech Coders • Hybrid codecs • Attempt to provide the best of both • Perform a degree of waveform matching • Utilize the sound production model • Quite good quality at low bit rate

Waveform coders • High quality, high bitrate • Pulse Code Modulation (PCM) • Sample input waveform • Quantization • Differential PCM • Sample input waveform • Encode difference between adjacent samples • Adaptive DPCM • Adapt step size for quantization based on speech statistics

Voice Sampling • A-to-D • discrete samples of the waveform and represent each sample by some number of bits • A signal can be reconstructed if it is sampled at a minimum of twice the maximum frequency (Nyquist Theorem) • Human speech • 300-3800 Hz • 8000 samples per second Each sample is encoded into an 8-bit PCM code word (e.g. 01100101) time => 8000 x 8 bit/s

Quantization • How many bits is used to represent • Quantization noise • The difference between the actual level of the input analog signal • More bits to reduce • Diminishing returns • Uniform quantization levels • Louder talkers sound better

Non-uniform quantization • % quantization error is larger for smaller values of x(t) • Goal: create a set of smaller % error at small signal values and similarly at large ones. • This process is called “companding” at the source encoding end and “decompanding” at the decoding (D/A) end. • The net effect is to make the sum of the quantization errors smaller and more uniform percentage-wise. • Logarithmic scaling (A-law in Europe and µ-law in US)

Non-uniform quantization • Smaller quantization steps at smaller signal levels • Spread signal-to-noise ratio more evenly

G.711 • The most commonplace codec • Used in circuit-switched telephone network • PCM, Pulse-Code Modulation • If uniform quantization • 12 bits * 8 k/sec = 96 kbps • Non-uniform quantization • 64 kbps DS0 rate • mu-law • North America • A-law • Other countries, a little friendlier to lower signal levels • An MOS of about 4.3

DPCM • DPCM, Differential PCM • Only transmit the difference between the predicated value and the actual value • Voice changes relatively slowly • It is possible to predict the value of a sample base on the values of previous samples • The receiver perform the same prediction • The simplest form • No prediction • No algorithmic delay

ADPCM (Adaptive DPCM) • Predicts sample values based on • Past samples • Factoring in some knowledge of how speech varies over time • The error is quantized and transmitted • Fewer bits required • G.721 • 32 kbps • G.726 • A-law/mu-law PCM -> 16, 24, 32, 40 kbps • An MOS of about 4.0 at 32 kbps

Subjective quality metrics for speech

Common Waveform Coders

Information rate of speech • Phonetic content at a rate of about 72 bits/second: • 6 bits sufficient for 40-50 different phonemes • Averagespeaking rate is about 12 phonemes/second • This neglects: • Intonation (no pitch transmitted) • Emotion • Individual characterization of speech (the ability to recognize the speaker) • Phone durations are different

Redundancies in speech • Our sampling frequency Fs is >> than vocal tract rate of change (with the exception of closures ) • F0 (or perceived pitch) changes slowly as compared to windowing rate • Adjacent windows correlate rather well • Spectral waveform changes slowly and most of the energy is at the low end of frequencies so it changes even more slowly there (important part of speech) • It is possible to model phones as periodic/noisy filtered excitation and still obtain reasonable quality • Speech parameters may be weighted since they occur nonuniformly (different probabilities) • The ear is insensitive to phase so it can be discarded

Average power spectrum of speech Notice that the frequency scale is logarithmic in this figure. Speech has in general higher power at the lower frequencies for sonorants and less power above 3.3kHz, as shown here.

Human Speech Production System • Air flow forced from lungs to vocal tract • short-term correlations • Filter with resonances (called formants) • Speech sound classes • Voiced sounds • Voice cord vibration • Long-term periodicity • Unvoiced sounds • Constriction in the vocal tract • No long-term periodicity • Plosive sounds • Release of air pressure behind mouth

A Little About Speech • Speech • Air pushed from the lungs past the vocal cords and along the vocal tract • The basic vibrations – vocal cords • The sound is altered by the disposition of the vocal tract ( tongue and mouth) • Model the vocal tract as a filter • The shape changes relatively slowly • The vibrations at the vocal cords • The excitation signal

Voiced Speech • The vocal cords vibrate open and close • Interrupt the air flow • Quasi-periodic pluses of air • The rate of the opening and closing – the pitch • A high degree of periodicity at the pitch period • 2-20 ms

Voiced speech Power spectrum density Voiced Speech

Unvoiced Speech • Forcing air at high velocities through a constriction • The glottis is held open • Noise-like turbulence • Show little long-term periodicity • Short-term correlations still present

unvoiced speech Power spectrum density Unvoiced Speech

Stops • Plosive sounds • A complete closure in the vocal tract • Air pressure is built up and released suddenly • A vast array of sounds • The speech signal is relatively predictable over time • The reduction of transmission bandwidth can be significant

Linear predictive Coding (LPC) • Predict current sample as linear combination of past samples • An all-pole model: • Minimize squared error • Orthogonality principle • Solution

Vocoders (source coders) • Linear prediction model for human voice system • Medium quality, low bitrate

Vector Quantization • Example • Key challenge • Given a source distribution, how to select codebook (*) and partitions (---) to result in smallest average distortion • Solution: • Divide and conquer • Two codes  four  eight …

Analysis-by-Synthesis (AbS) Codecs • Hybrid method • Vocoder’s linear prediction model • Careful selection of excitation signal to reconstruct original waveform • High quality, low bitrate! • The most successful and commonly used • Time-domain AbS codecs • Not a simple two-state, voiced/unvoiced • Different excitation signals are attempted • Closest to the original waveform is selected • Types: • MPE, Multi-Pulse Excited • RPE, Regular-Pulse Excited • CELP, Code-Excited Linear Predictive

Linear-Prediction-based Analysis-by-Synthesis • How it works • Segment speech into frames (typically 20ms long) • Find filter parameter for each frame • Find excitation whose that minimizes prediction error • Perceptual weighting • More accuracy where speech energy is low • Transmit the filter parameter and excitation signal • Vector quantization

LPAS Classification • Three classes • Multi-Pulse Excited (MPE) • Regular-Pulse Excited (RPE) • Code-Excited Linear Predictive (CELP) • Difference lies in representation of excitation signal

Multi-Pulse Excited (MPE) • Excitation is given by a fixed number of pulses • Position and amplitude of the pulses are computed to minimize error and transmitted to decoder • Finding the best match is theoretically possible but not practical • Suboptimal estimations are given • Typically about 4 pulses per 5 ms are used

Regular-Pulse Excited (RPE) • Multiple pulses used like in MPE • Regularly spaced at fixed period • Only needs to transmit first pulse’s position and all pulses amplitude • More pulses are allowed for better quality at same bitrate • Around 10 pulses per 5 ms

Code-Excited Linear Predictive (CELP) • Excitation is given by • an entry from a large vector quantizer codebook • A gain term for its power (amplitude) • Key challenge • Searching for the right excitation entries in realtime • Solution: restructure the codebook optimized for searching (such as a tree) • Performance • 4.8kbps or lower bitrate with good quality

Further Improvements on CELP • Representation of pitch period • Adaptive Long-term prediction + short-term adjustment • Coding of LP filter • Vector quantization of filter representation • Multimode coding • Dynamic bit allocation between excitation, LP filter and pitch

G.728 LD-CELP • CELP codecs • A filter; its characteristics change over time • A codebook of acoustic vectors • A vector = a set of elements representing various char. of the excitation • Transmit • Filter coefficients, gain, a pointer to the vector chosen • Low Delay CELP • Backward-adaptive coder • Use previous samples to determine filter coefficients • Operates on five samples at a time • Delay < 1 ms • Only the pointer is transmitted

G.728 LD-CELP • 1024 vectors in the code book • 10-bit pointer (index) • 16 kbps • LD-CELP encoder • Minimize a frequency-weighted mean-square error

G.728 LD-CELP • MOS score of about 3.9 • One-quarter of G.711 bandwidth (16kbps) • 30 MIPS • 2 kilobytes of RAM is needed for codebooks • 50th order LPC filter. • Lower delays are obtained by making the excitation vectors very short (~5 samples or 0.625 ms)

Speech Coding EE 516 Spring 2009

Speech Coding EE 516 Spring 2009

Presentation Transcript

EE 516 Lecture 1

EE 6331, Spring, 2009 Advanced Telecommunication

SPEECH CODING

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Coding

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Coding EE 516 Spring 2009

Speech Coding

EE 6331, Spring, 2009 Advanced Telecommunication

Speech Enhancement EE 516 Spring 2009

Speech-Coding Techniques

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

Speech coding

EE 6331, Spring, 2009 Advanced Telecommunication

EE 6331, Spring, 2009 Advanced Telecommunication

JRN 516, Spring 2009

EE 6331, Spring, 2009 Advanced Telecommunication