150 likes | 377 Views
Advances in Speech and Audio Compression. Allen Gersho (1994) Presented by Bin Yu. Agenda. Introduction Human vocal tract model Speech coder classification LPAS Speech Coding Standardization. Human Speech Production System. Air flow forced from lungs to vocal tract
E N D
Advances in Speech and Audio Compression Allen Gersho (1994) Presented by Bin Yu
Agenda • Introduction • Human vocal tract model • Speech coder classification • LPAS Speech Coding • Standardization
Human Speech Production System • Air flow forced from lungs to vocal tract • short-term correlations • Filter with resonances (called formants) • Speech sound classes • Voiced sounds • Voice cord vibration • Long-term periodicity • Unvoiced sounds • Constriction in the vocal tract • No long-term periodicity • Plosive sounds • Release of air pressure behind mouth
Speech Coder Classification • Waveform coders • Pulse Code Modulation (PCM) • Sample input waveform • Quantization • Differential PCM • Sample input waveform • Encode difference between adjacent samples • Adaptive DPCM • Adapt step size for quantization based on speech statistics • High quality, high bitrate
Speech Coder Classification (contd.) • Vocoders (source coders) • Linear prediction model for human voice system • Medium quality, low bitrate
LPAS – Linear-Prediction-based Analysis-by-Synthesis • Hybrid method • Vocoder’s linear prediction model • Careful selection of excitation signal to reconstruct original waveform • High quality, low bitrate!
LPAS – Linear-Prediction-based Analysis-by-Synthesis • How it works • Segment speech into frames (typically 20ms long) • Find filter parameter for each frame • Find excitation whose that minimizes prediction error • Perceptual weighting • More accuracy where speech energy is low • Transmit the filter parameter and excitation signal • Vector quantization
Vector Quantization • Example • Key challenge • Given a source distribution, how to select codebook (*) and partitions (---) to result in smallest average distortion • Solution: • Divide and conquer • Two codes four eight …
LPAS Classification • Three classes • Multi-Pulse Excited (MPE) • Regular-Pulse Excited (RPE) • Code-Excited Linear Predictive (CELP) • Difference lies in representation of excitation signal
Multi-Pulse Excited (MPE) Codec • Excitation is given by a fixed number of pulses • Position and amplitude of the pulses are computed to minimize error and transmitted to decoder • Finding the best match is theoretically possible but not practical • Suboptimal estimations are given • Typically about 4 pulses per 5 ms are used
Regular-Pulse Excited (RPE) • Multiple pulses used like in MPE • Regularly spaced at fixed period • Only needs to transmit first pulse’s position and all pulses amplitude • More pulses are allowed for better quality at same bitrate • Around 10 pulses per 5 ms
Code-Excited Linear Predictive (CELP) • Excitation is given by • an entry from a large vector quantizer codebook • A gain term for its power (amplitude) • Key challenge • Searching for the right excitation entries in realtime • Solution: restructure the codebook optimized for searching (such as a tree) • Performance • 4.8kbps or lower bitrate with good quality
Further Improvements on CELP • Representation of pitch period • Adaptive Long-term prediction + short-term adjustment • Coding of LP filter • Vector quantization of filter representation • Multimode coding • Dynamic bit allocation between excitation, LP filter and pitch
Standardization • ITU G.711 • PCM (A-law and U-low) • ITU G.722 • ADPCM • ITU G.728 • Low-Delay CELP • GSM • Algebraic CELP • MPEG4 • MPEG4 CELP
Refereces • Human voice model • http://cnx.rice.edu/content/m0049/latest/ • Speech Compression • http://www.data-compression.com/speech.shtml • Speech coding tutorial • http://www-mobile.ecs.soton.ac.uk/speech_codecs/ • Standard codecs • http://www.ittiam.com/pages/products/g711.htm • Spanias, AS, “Speech coding, a tutorial review”, 1994