Speech codecs and DCCP with TFRC VoIP mode

Speech codecs and DCCP with TFRC VoIP mode Magnus Westerlund magnus.westerlund@ericsson.com

Important Features of TFRC VoIP mode • Minimum packet interval 10 ms • Packet rate is penalized: • X = X * S_true / (S_true + H) • H=40; Header size • S_true is complete RTP packet size, i.e. RTP+Payload • Still TFRC and sending is delayed if not sufficient bit-rate available. • Slow start of 4 packets, the size limitation is not an issue for the discussed codecs.

System overview • Contributors to system delay are: • Sampling buffering • Encoding delay • Packetization delay • Transmission delay • Transport delay (Internet) • Receiver buffering delay • Decoding delay • Playout delay • Sum of delays less than 200 ms for high quality conversational, less than 400 ms to be usable for conversational VoIP Sender Receiver MIC Speaker Codec Codec Payload Packetization Jitter Buffer DCCP DCCP Internet

Problems with TFRC style packet rate penalties • Varying the packetization, directly affects the system delay seen at the receiver. • Requires a jitter buffer that is capable of handling the increased or decreased system delay. • Frequent changes will make it more problematic for adaptive buffers to correctly parameterize the jitter. • Buffer under-runs needs to be handled with little impact on voice quality. Thus insertion of audio data or invoking of error concealment becomes required.

Narrowband codecs: G.711 (PCMA or PCMU) G.723 G.726 G.728 G.729 GSM GSM-EFR AMR EVRC SMV QCELP BroadVoice 16 iLBC Wideband codecs AMR-WB VMR-WB BroadVoice 32 G.722 Variable sampling rate DVI4 VDVI L8 L16 PCMA PCMU Speech and Audio Codecs with RTP Payload formats

Codec and RTP payload properties • Bit-rate of encoded content • Sample or frame based • Frame lengths: 2.5, 5, 10, 20, 30, etc. frame lengths in milliseconds • Basically all payload formats supports aggregation, however some have modes where it is restricted.

DTX and Comfort Noise • DTX is Discontinuous Transmission • Voice activity detector (VAD) detects if there is active speech or not. • When there is no active speech different DTX procedures can be used: • No Transmission at all • Comfort Noise (CN) using RFC 3389 • Codec built CN in like AMR SID (Silence Descriptor) • Frequency of Comfort Noise packets varies but is usually some fraction of normal packet rate

Sample based codecs • Speech bandwidth depends on sampling rate. • Sample based, and can usually handle any number of samples per packet. • Usually no adaptivity other than packetization. Some can vary quantization, like G.726. • Bit-rate depends on sampling rate and sample quantization. • Example: G.711 uses 8 bits per sample, and 8kHz sampling. Resulting in 64 kbps audio data rate. • Comfort noise may be supported using RFC 3389.

AMR • 3GPP defined, mandatory speech codec in UMTS 3G networks • Narrowband codec (8kHz audio sampling rate) • Frame-based with 20ms frames • Multi-rate: has 8 encoding modes with bit-rate between 12.2 and 4.75 kbps. • Has comfort noise generation (SID) and DTX. • The SID (Silence Descriptor) is sent in every 8th frame and is 5 bytes in size.

EVRC and SMV • 3GPP2 defined, required in CDMA networks • Narrowband codecs (8kHz audio sampling rate) • Frame-based with 20 ms frames • Encodes at 3 (EVRC) or 4 (SMV) different rates, varying from 8.55 to 0.8 kbps depending on audio input. Thus highly variable packet sizes. • The average bit-rate is dependent on codec modes, Each mode selects the used encoding rates differently to provide different average rates. • Lacks DTX and needs to transmit all frames. • One mode in the payload format requires a single frame per packet.

Broad Voice 16 • Broadcom defined coded, used in voice over cable • Narrowband codec (8kHz audio sampling rate) • Frame-based with 5ms frames, thus needing at least 2 frames per packet aggregation for TFRC VoIP mode. • No rate adaptation, fixed encoding at 16 kbps. • No built in comfort noise or DTX.

Broad Voice 32 • Broadcom defined coded, used in voice over cable • Wideband codec (16kHz audio sampling rate) • Frame-based with 5ms frames, thus needing at least 2 frames per packet aggregation for TFRC VoIP mode. • No rate adaptation, fixed encoding at 32 kbps. • No built in comfort noise or DTX.

AMR-WB • 3GPP specified codec, mandatory in UMTS 3G if wideband supported • Wideband codec (16kHz audio sampling rate) • Frame-based with 20ms frames • Multi-rate encoding at 9 different rates between 23.85 and 6.6 kbps • Has built in support for DTX and comfort noise (SID) • SID (silence descriptor) is sent every 8th frame and is 5 bytes in size

VMR-WB • 3GPP2 defined • Wideband Codec (16kHz audio sampling rate) • Frame-based with 20 ms frames • Encodes using 4 different rates (13.3-1.0 kbps) • Has compatibility mode with AMR-WB (12.6, 8.85, 6.60) • Has DTX mode

Summary of codecs

The effects of codec bit-rate adaptation • Reduction of codec bit-rate always means lower quality • The actual switching does affect user perceived quality: • Codec transition effects (varying) • The change in quality can be noticeable • Switching to higher codec rate may not improve user experience. • Flapping between modes can be more annoying than constant lower quality

Other codec developments • Audio encoding, rather than speech: • Greater bit-rate span 10-300 kbps • Variable frame-rate, depending on codec mode (AMR-WB+), which is problematic in RTP • Currently scalability is hot: • For audio, usually not speech • MPEG is doing something • European union research project assuming arbitrary truncation of packets

Effects of packetization • The AMR codec bit-rate adaptation has less impact than the choice of packetization on total bandwidth. • Calculated using IP (20) + DCCP (12) + RTP (12) headers for each packet • Not unexpected considering that a speech frame including payload overhead is 13, 18 and 32 bytes.

System Delay Overview • Contributors to system delay are: • Sampling buffering • Encoding delay • Packetization delay • Transmission delay • Transport delay (Internet) • Receiver buffering delay • Decoding delay • Playout delay • Sum of delays less than 200 ms for high quality conversational, less than 400 ms to be usable for conversational VoIP Sender Receiver MIC Speaker Codec Codec Payload Packetization Jitter Buffer DCCP DCCP Internet

Delay and Robustness Effects • Although it seems tempting to use 3 frames per packet to save bandwidth it will cost much delay. • For optimal quality there is need to trade off quality reduction from lower bit-rate modes against the expected system delay. • For a system which already have a big delay; reduce codec mode. • For a system with small delays changing packetization to use more frames per packet can be done without much quality cost. • More frames per packet also reduces robustness

Questions for future studies • How hard is it to maintain an periodic transmission with TFRC VoIP mode? Otherwise it will introduce extra jitter, which requires more receiver buffering. • What is the effects of DTX, like in the AMR case, where the packet rate drops to an 1/8th compared to active speech.

Speech codecs and DCCP with TFRC VoIP mode