200 likes | 304 Views
Speech over Packet Networks Variable Jitter Buffering Decoder-Based Time-Scaling Performance Analysis. Performance Analysis of a Decoder-Based Time-Scaling Algorithm for Variable Jitter Buffering of Speech over Packet Networks. Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation
E N D
Speech over Packet NetworksVariable Jitter BufferingDecoder-Based Time-ScalingPerformance Analysis Performance Analysis of aDecoder-Based Time-Scaling Algorithmfor Variable Jitter Buffering ofSpeech over Packet Networks Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation 750, Chemin Lucerne Montréal (Québec) Canada H3R 2H6 Philippe.Gournay@USherbrooke.ca
Voice over Packet Networks • Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter) • VoIP receivers generally use a jitter buffer to control the effect of the jitter • The jitter buffer works by introducing an additional playout delay • The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits • Packets that arrive before their playout time are temporarily stored in a reception buffer
Transmission delay 0 1 2 n n+1 Sender 0 1 2 Receiver 0 1 2 n n+1 Playout Playout delay Voice over Packet Networks (2) Fixed transmission time (no jitter) a fixed playout delay is enough to produce a sustained flow of speech to the listener
0 1 2 n n+1 Sender 0 1 2 Receiver Transmission delay 0 1 2 Playout Playout delay Dn-1 Dn Dn+1 Voice over Packet Networks (3) Variable transmission delay (some jitter), fixed playout delay some packets (n, n+1) arrive too late to be decoded
Jitter Buffering Strategies • Fixed jitter buffer • Playout delay chosen at the beginning of the conversation • Variable Jitter buffer • Talk-spurt based: playout delay changed at the beginning of each silence period • For quickly varying networks: better results are obtained when the playout delay is also adapted during active speech
Playout Delay Adaptation • Using past jitter values, estimate the “ideal” playout time Pi+1 of frame number i+1. • Send frame number i to the decoder, requesting it to generate an output frame of length Ti=Pi+1-Pi. • The actual playout time of packet i+1 is Pi+1=Pi+Ti, where Ti is the actual length of frame i. Iterate from step 1. ^ ^ ^ The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate
Time Scaling Inside the Decoder … presents several advantages over “standalone” time scaling (SOLA, TDHC, …) : • Uses the decoder’s internal parameters: • Pitch period • In VMR-WB: voicing classification • Regulates the processor load (esp. for shorter frames) • Number of operations per second tends to increase as the frame length decreases • Some complexity is saved during the synthesis operation • Improves quality • smoothing performed by the synthesis filters
General Principle • Time scaling is performed in the excitation domain • The adaptive codebook is updated before time scaling to keep the encoder and decoder synchronized • Frames are modified depending on their voicing classification • In VMR-WB, voicing classification is a part of the bitstream • Not all frames are modified • Voiced frames can only be modified by a multiple number of pitch periods • Concealed frames can also be time-scaled
Inactive Frames • The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples. • Output frame duration limited to between 0 and 40ms (twice the standard duration)
Unvoiced Frames • Plosive frames and frames that are too voiced are not modified • To lengthen frames: • Insert zeroes between the original excitation samples • Adjust gain to preserve average energy per sample • To shorten frames: • Remove samples from the excitation signal • Frame duration limited to between 10 and 40ms
Voiced Frames • Onset frames and frames that are not voiced enough are not modified • To lengthen frames • Use the long-term predictor to duplicate some pitch cycles • To shorten frames • Remove selected pitch cycles • Frame duration limited to between 0 and 40ms
Past Excitation Current Excitation (i) 1 2 3 4 Subframes T0 1 2 3 4 To Lengthen Voiced Frames Voiced frames are lengthened by repeating selected pitch cycles
Experimental Results • Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech) • Subjective quality: • Excellent for “fast playback” (up to twice the normal speed) • Small degradation for “slow playback” (down to half the normal speed) • Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time • Much better than losing a few frames…
Distribution of Unmodified Frames Required frame length (40ms) is twice the standard frame length Total number of frames: 22803 Number of unmodified frames: 4085 (18%) Distribution of unmodified frames: 1. Voiced frames Onsets: 814 (20% of unmodified) Not voiced enough: 935 (23%) 2. Unvoiced frames Plosive: 1850 (45%) Too voiced: 486 (12%)
Frame Length Distribution for Modified Voiced Frames Desired frame length was 640 samples = 40 ms
Time Required for a 50-ms Increase of the Playout Delay Experiment done for 8000 different active speech frames
Maximum Complexity and Corresponding Frame Length *: Modified voiced frames not allowed to be less than 10 ms
Optimize the Complexity • Lengthening does not increase complexity • Shortening CNG frames does not increase complexity • Shortening active speech frames increases complexity • Good compromise: • Increase the playout delay as soon as it is necessary • Decrease the playout delay during inactive periods • Playout delay adaptation requires no additional complexity
Summary • Adaptive jitter buffering requires a means of time scaling of speech • Time scaling can be done in the decoder’s “excitation domain” • This approach is very efficient in terms of both quality and reactivity • It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling