220 likes | 320 Views
ADPCM ON TENSILICA. Xiaoling Xu and Fan Mo EECS, UC Berkeley. DESIGN GOAL. Basic Algorithm Two Streams Approach Make use of Tensilica’s Special Features Results Conclusion. Step Size Calculation. Adjusted step size ss(n+1). Z -1. Step size ss(n). +. Encoder. X(n) Input sample.
E N D
ADPCM ON TENSILICA Xiaoling Xu and Fan Mo EECS, UC Berkeley
DESIGN GOAL • Basic Algorithm • Two Streams Approach • Make use of Tensilica’s Special Features • Results • Conclusion
Step Size Calculation Adjusted step size ss(n+1) Z-1 Step size ss(n) + Encoder X(n) Input sample d(n) Difference L(n) ADPCM output sample 4 bits _ Decoder X(n) estimate X(n-1) estimate of last input sample Z-1 ADPCM ENCODER
Step Size Calculation Adjusted step size ss(n+1) Z-1 Z-1 Step size ss(n) X(n-1) Decoder L(n) ADPCM input sample 4 bits d(n) Difference + X(n) Output sample ADPCM DECODER
ENCODING ALGORITHM StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767 }; Encoding(*input) { loop(number of samples) { X=*input++; D=X-X-1; S=StepsizeTable(Index); Xa=|X|; Code=0; if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; X-1=(X>0)?X-1:X; if (X-1>32767) X-1 =32767; if (X-1<-32768) X-1 =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=Code; } } IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8 };
DECODING ALGORITHM StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767 }; Decoding(*Code) { C=*Code++; S=StepsizeTable(Index); D=0; if (C[2]==1) D+=S; S/=2; if (C[1]==1) D+=S; S/=2; if (C[0]==1) D+=S; if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=X; X-1=X; } IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8 };
S 0XX 1XX S’ 00X 01X 10X 11X S’’ ALTERNATIVE APPROACHES USING MULTIPLICATION Multiplier is there. Why not use it? if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; Code[2:0]=Xa/S*4=Xa*(1/S)*4; X-1=Code[2:0]*S/4; (1/S) is stored in a table. USING MORE TABLES Build tables for all possible paths. if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; Xa-=S; Code[2]=~MSB(Xa); Xa-=S’[Code]; Code[1]=~MSB(Xa); Xa-=S’’[Code]; Code[0]=~MSB(Xa); Eg. S’[0XX]=S/2; S’[1XX]=-S+S/2;
BUT... • Earlier experiments showed that neither approaches give big improvement. WHY? • Multiplication takes many cycles. • Too many tables cause large cache miss.
UNIQUE OPERATIONS Decoding(*Code) { C=*Code++; S=StepsizeTable(Index); D=0; if (C[2]==1) D+=S; S/=2; if (C[1]==1) D+=S; S/=2; if (C[0]==1) D+=S; if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=X; X-1=X; } IF (…) … ELSE ... Encoding(*input) { loop(number of samples) { X=*input++; D=X-X-1; S=StepsizeTable(Index); Xa=|X|; Code=0; if (Xa>S) { Code[2]=1; X-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; X-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; X-=S; X-1+=S; } Code[3]|=(X>0)?0:1; X-1=(X>0)?X-1:X; if (X-1>32767) X-1 =32767; if (X-1<-32768) X-1 =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=Code; } } CLAMP
StreamA Data StreamB Data 31 16 | 15 0 UNIQUE DATA STRUCTURE • Most data shorter than or equal to 16-bit. • Since register is 32-bit, why not put two data in one register • But in some place, the 17th bit is required to store the intermediate results. if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; X has to be 17-bit
DUAL STREAM ENCODER DUAL STREAM DECODER WHY NOT TWO STREAMS? Difficult?
FIRST APPROACH: • Control-Oriented Application is hard to do parallel operations. • Modify the algorithm into a more computation-oriented approach by using multiply. • Speedup • 10% for single stream • 0% for two streams due to high cache misses. • Why? • 16-bit multiplication results a 32-bit data .
XA-1 XB-1 + SA SB 31 16 | 15 0 ANOTHER APPROACH • Keep Control-Oriented Approach: • 1. How to block the carry/borrow between bit16 and bit15? • 2. How to carry out two “If (..) ..” in one instruction? • 3. How to encapsulate two 17-bit data in a 32-bit register?
TIE Instruction 1. How to carry out two “If (..) ..” in one instruction? if (data1>bound) data1=bound; if (data2>bound) data2=bound; if(data2|data1 > bound) data2|data1 = bound|bound data2 data1 - bound bound 31 30 15 0 data2 data2 2:1 mux 2:1 mux bound bound data2 data1
TIE Instructions • 2. How to encapsulate two 17-bit data in a 32-bit register? • data1 += diff1; data2 += diff2; • if (data1 > 32767) data1 = 32767 if(data2 > 32767) data2 = 32767 • data2|data1 += diff2|diff1; data2 data1 + diff2 diff1 result1 result2 31 16 | 15 0 result1 result2 2:1 mux 2:1 mux 32767 32767 data2 data1
CONSTANT TABLES • A lot of table lookup instructions in the original algorithm. • Access constant table from cache is slow. • Increase cache miss rate • increase # of memory access instructions • Using constant table! • Tensilica has tables come with the processor. • Almost no extra cost to access the tables.
CONCLUSION • TIE extensions and improved code efficiency resulted in an order of magnitude improvement from our original • Constant table helps to decrease cache access and cache miss. • Tensilica is also able to handle control-oriented applications.