Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne

Speech over Packet NetworksVariable Jitter BufferingDecoder-Based Time-ScalingPerformance Analysis Performance Analysis of aDecoder-Based Time-Scaling Algorithmfor Variable Jitter Buffering ofSpeech over Packet Networks Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne Montréal (Québec) Canada H3R 2H6 Philippe.Gournay@USherbrooke.ca

Voice over Packet Networks • Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter) • VoIP receivers generally use a jitter buffer to control the effect of the jitter • The jitter buffer works by introducing an additional playout delay • The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits • Packets that arrive before their playout time are temporarily stored in a reception buffer

Transmission delay 0 1 2 n n+1 Sender 0 1 2 Receiver 0 1 2 n n+1 Playout Playout delay Voice over Packet Networks (2) Fixed transmission time (no jitter) a fixed playout delay is enough to produce a sustained flow of speech to the listener

0 1 2 n n+1 Sender 0 1 2 Receiver Transmission delay 0 1 2 Playout Playout delay Dn-1 Dn Dn+1 Voice over Packet Networks (3) Variable transmission delay (some jitter), fixed playout delay some packets (n, n+1) arrive too late to be decoded

Jitter Buffering Strategies • Fixed jitter buffer • Playout delay chosen at the beginning of the conversation • Variable Jitter buffer • Talk-spurt based: playout delay changed at the beginning of each silence period • For quickly varying networks: better results are obtained when the playout delay is also adapted during active speech

Playout Delay Adaptation • Using past jitter values, estimate the “ideal” playout time Pi+1 of frame number i+1. • Send frame number i to the decoder, requesting it to generate an output frame of length Ti=Pi+1-Pi. • The actual playout time of packet i+1 is Pi+1=Pi+Ti, where Ti is the actual length of frame i. Iterate from step 1. ^ ^ ^ The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate

Time Scaling Inside the Decoder … presents several advantages over “standalone” time scaling (SOLA, TDHC, …) : • Uses the decoder’s internal parameters: • Pitch period • In VMR-WB: voicing classification • Regulates the processor load (esp. for shorter frames) • Number of operations per second tends to increase as the frame length decreases • Some complexity is saved during the synthesis operation • Improves quality • smoothing performed by the synthesis filters

General Principle • Time scaling is performed in the excitation domain • The adaptive codebook is updated before time scaling to keep the encoder and decoder synchronized • Frames are modified depending on their voicing classification • In VMR-WB, voicing classification is a part of the bitstream • Not all frames are modified • Voiced frames can only be modified by a multiple number of pitch periods • Concealed frames can also be time-scaled

Inactive Frames • The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples. • Output frame duration limited to between 0 and 40ms (twice the standard duration)

Unvoiced Frames • Plosive frames and frames that are too voiced are not modified • To lengthen frames: • Insert zeroes between the original excitation samples • Adjust gain to preserve average energy per sample • To shorten frames: • Remove samples from the excitation signal • Frame duration limited to between 10 and 40ms

Voiced Frames • Onset frames and frames that are not voiced enough are not modified • To lengthen frames • Use the long-term predictor to duplicate some pitch cycles • To shorten frames • Remove selected pitch cycles • Frame duration limited to between 0 and 40ms

Past Excitation Current Excitation (i) 1 2 3 4 Subframes T0 1 2 3 4 To Lengthen Voiced Frames Voiced frames are lengthened by repeating selected pitch cycles

Experimental Results • Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech) • Subjective quality: • Excellent for “fast playback” (up to twice the normal speed) • Small degradation for “slow playback” (down to half the normal speed) • Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time • Much better than losing a few frames…

Distribution of Unmodified Frames Required frame length (40ms) is twice the standard frame length Total number of frames: 22803 Number of unmodified frames: 4085 (18%) Distribution of unmodified frames: 1. Voiced frames Onsets: 814 (20% of unmodified) Not voiced enough: 935 (23%) 2. Unvoiced frames Plosive: 1850 (45%) Too voiced: 486 (12%)

Frame Length Distribution for Modified Voiced Frames Desired frame length was 640 samples = 40 ms

Time Required for a 50-ms Increase of the Playout Delay Experiment done for 8000 different active speech frames

Maximum Complexity and Corresponding Frame Length *: Modified voiced frames not allowed to be less than 10 ms

Optimize the Complexity • Lengthening does not increase complexity • Shortening CNG frames does not increase complexity • Shortening active speech frames increases complexity • Good compromise: • Increase the playout delay as soon as it is necessary • Decrease the playout delay during inactive periods • Playout delay adaptation requires no additional complexity

Audio Demonstration

Summary • Adaptive jitter buffering requires a means of time scaling of speech • Time scaling can be done in the decoder’s “excitation domain” • This approach is very efficient in terms of both quality and reactivity • It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling

Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne

Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne

Presentation Transcript

M. Kesselring; Arthur D. Little, Inc. M. Anderson; Arthur D. Little, Inc. M. Walters; Polaroid Corporation

Anderson Community School Corporation

Philippe

By: Ray Moorman Dan McLindon Jeremy Smiley Kyle McDaniel Tom Anderson

Lucerne Publishing

City of Lucerne

Kyle

Lucerne Publishing

Kyle

Philippe Gournay

Professor Kevin Gournay , CBE

A. Herr, C. Clark, F. S. Anderson, and D. T. Anderson

Kyle

Yann Chemin

Philippe Gournay, Bruno Bessette, Roch Lefebvre

Lucerne - Lake

PowerPC 750

John F. Kennedy by Kyle D. Adamcik

LUCERNE -ALFALFA

Curamed 750

KYLE BARDO UIC / GATX Corporation