ADPCM ON TENSILICA

ADPCM ON TENSILICA Xiaoling Xu and Fan Mo EECS, UC Berkeley

DESIGN GOAL • Basic Algorithm • Two Streams Approach • Make use of Tensilica’s Special Features • Results • Conclusion

Step Size Calculation Adjusted step size ss(n+1) Z-1 Step size ss(n) + Encoder X(n) Input sample d(n) Difference L(n) ADPCM output sample 4 bits _ Decoder X(n) estimate X(n-1) estimate of last input sample Z-1 ADPCM ENCODER

Step Size Calculation Adjusted step size ss(n+1) Z-1 Z-1 Step size ss(n) X(n-1) Decoder L(n) ADPCM input sample 4 bits d(n) Difference + X(n) Output sample ADPCM DECODER

ENCODING ALGORITHM StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767 }; Encoding(*input) { loop(number of samples) { X=*input++; D=X-X-1; S=StepsizeTable(Index); Xa=|X|; Code=0; if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; X-1=(X>0)?X-1:X; if (X-1>32767) X-1 =32767; if (X-1<-32768) X-1 =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=Code; } } IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8 };

DECODING ALGORITHM StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767 }; Decoding(*Code) { C=*Code++; S=StepsizeTable(Index); D=0; if (C[2]==1) D+=S; S/=2; if (C[1]==1) D+=S; S/=2; if (C[0]==1) D+=S; if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=X; X-1=X; } IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8 };

S 0XX 1XX S’ 00X 01X 10X 11X S’’ ALTERNATIVE APPROACHES USING MULTIPLICATION Multiplier is there. Why not use it? if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; Code[2:0]=Xa/S*4=Xa*(1/S)*4; X-1=Code[2:0]*S/4; (1/S) is stored in a table. USING MORE TABLES Build tables for all possible paths. if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; } Code[3]|=(X>0)?0:1; Xa-=S; Code[2]=~MSB(Xa); Xa-=S’[Code]; Code[1]=~MSB(Xa); Xa-=S’’[Code]; Code[0]=~MSB(Xa); Eg. S’[0XX]=S/2; S’[1XX]=-S+S/2;

BUT... • Earlier experiments showed that neither approaches give big improvement. WHY? • Multiplication takes many cycles. • Too many tables cause large cache miss.

UNIQUE OPERATIONS Decoding(*Code) { C=*Code++; S=StepsizeTable(Index); D=0; if (C[2]==1) D+=S; S/=2; if (C[1]==1) D+=S; S/=2; if (C[0]==1) D+=S; if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=X; X-1=X; } IF (…) … ELSE ... Encoding(*input) { loop(number of samples) { X=*input++; D=X-X-1; S=StepsizeTable(Index); Xa=|X|; Code=0; if (Xa>S) { Code[2]=1; X-=S; X-1+=S; } S/=2; if (Xa>S) { Code[1]=1; X-=S; X-1+=S; } S/=2; if (Xa>S) { Code[0]=1; X-=S; X-1+=S; } Code[3]|=(X>0)?0:1; X-1=(X>0)?X-1:X; if (X-1>32767) X-1 =32767; if (X-1<-32768) X-1 =-32768; Index+=IndexTable(Code); if (Index>88) Index=88; if (Index<0) Index=0; *output++=Code; } } CLAMP

StreamA Data StreamB Data 31 16 | 15 0 UNIQUE DATA STRUCTURE • Most data shorter than or equal to 16-bit. • Since register is 32-bit, why not put two data in one register • But in some place, the 17th bit is required to store the intermediate results. if (Code[3]==1) X=X-1-D; else X=X-1+D; if (X>32767) X =32767; if (X<-32768) X =-32768; X has to be 17-bit

DUAL STREAM ENCODER DUAL STREAM DECODER WHY NOT TWO STREAMS? Difficult?

FIRST APPROACH: • Control-Oriented Application is hard to do parallel operations. • Modify the algorithm into a more computation-oriented approach by using multiply. • Speedup • 10% for single stream • 0% for two streams due to high cache misses. • Why? • 16-bit multiplication results a 32-bit data .

XA-1 XB-1 + SA SB 31 16 | 15 0 ANOTHER APPROACH • Keep Control-Oriented Approach: • 1. How to block the carry/borrow between bit16 and bit15? • 2. How to carry out two “If (..) ..” in one instruction? • 3. How to encapsulate two 17-bit data in a 32-bit register?

TIE Instruction 1. How to carry out two “If (..) ..” in one instruction? if (data1>bound) data1=bound; if (data2>bound) data2=bound; if(data2|data1 > bound) data2|data1 = bound|bound data2 data1 - bound bound 31 30 15 0 data2 data2 2:1 mux 2:1 mux bound bound data2 data1

TIE Instructions • 2. How to encapsulate two 17-bit data in a 32-bit register? • data1 += diff1; data2 += diff2; • if (data1 > 32767) data1 = 32767 if(data2 > 32767) data2 = 32767 • data2|data1 += diff2|diff1; data2 data1 + diff2 diff1 result1 result2 31 16 | 15 0 result1 result2 2:1 mux 2:1 mux 32767 32767 data2 data1

CONSTANT TABLES • A lot of table lookup instructions in the original algorithm. • Access constant table from cache is slow. • Increase cache miss rate • increase # of memory access instructions • Using constant table! • Tensilica has tables come with the processor. • Almost no extra cost to access the tables.

CONSTANT TABLES

TWO STREAM RESULTS

COMPARISON

CONCLUSION • TIE extensions and improved code efficiency resulted in an order of magnitude improvement from our original • Constant table helps to decrease cache access and cache miss. • Tensilica is also able to handle control-oriented applications.

ADPCM ON TENSILICA

ADPCM ON TENSILICA

Presentation Transcript

ADPCM Decode

CCITT G.726 (ADPCM)

Run-on Sentences -on Sentences

Tensilica based simulator for Smart Memories

Tensilica lecture

Adult and on and on and on and on and on

Xtensa C and C++ Compiler Ding-Kai Chen Tensilica, Inc dkchen@tensilica

On

Leading Change: On Target, On Time, On Budget

Keep On Keepin ’ On

Hands-on on data management

on _ _

Figure 4-1 ADPCM (adaptive differential pulse code modulation)

Hands-on on security

Minds-on Hands-on

ON!