160 likes | 360 Views
Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060. Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu. ECE 734 VLSI Array Structures for Digital Signal Processing. Spring 2004. Agenda. Abstract DWT C implementation
E N D
Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu ECE 734 VLSI Array Structures for Digital Signal Processing Spring 2004
Agenda • Abstract • DWT C implementation • DWT TMS320 C62 Assembly Code • Without optimization • Speed optimization • Pipeline optimization (by us) • Result comparison • Jpeg 2000 and DWT (if we have free time)
In this project, we would like to implement and optimize DWT algorithm ,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1st Step, we implemented 2D DWT algorithm by C code; 2nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4th Step, we compare the performance between before and after our optimization. Abstract Spring 2004
... #define S(i) a[x*(i)*2] ... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x]; ... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) { ... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w); ... } } void main() { ... dwt_encode(image[0], 200, 165, 8); ... } C code Implementation Spring 2004
;----------------------------------------------------------------------;---------------------------------------------------------------------- ; 24 | void dwt_deinterleave(int *a, int n, int x) ;---------------------------------------------------------------------- _dwt_deinterleave: ;** --------------------------------------------------------------------------* ... ;---------------------------------------------------------------------- ; 31 | for (i=0; i<sn; i++) ;---------------------------------------------------------------------- ZERO .D2 B4 ; |31| STW .D2T2 B4,*+SP(24) ; |31| LDW .D2T2 *+SP(24),B5 ; |31| LDW .D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT .L2 B5,B4,B0 ; |31| [!B0] B .S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1: .line 9 ; 32 | b[i]=a[2*i*x]; ;---------------------------------------------------------------------- LDW .D2T2 *+SP(24),B4 ; |32| LDW .D2T2 *+SP(12),B5 ; |32| LDW .D2T2 *+SP(4),B6 ; |32| NOP 2 ADD .D2 B4,B4,B4 MPYLH .M2 B5,B4,B8 ; |32| MPYLH .M2 B4,B5,B7 ; |32| MPYU .M2 B5,B4,B5 ; |32| ADD .D2 B8,B7,B4 ; |32| SHL .S2 B4,16,B4 ; |32| ADD .S2 B5,B4,B4 ; |32| || LDW .D2T2 *+SP(28),B7 ; |32| LDW .D2T2 *+B6[B4],B4 ; |32| LDW .D2T2 *+SP(24),B5 ; |32| NOP 4 STW .D2T2 B4,*+B7[B5] ; |32| LDW .D2T2 *+SP(24),B4 ; |32| NOP 4 ADD .D2 1,B4,B4 ; |32| STW .D2T2 B4,*+SP(24) ; |32| LDW .D2T2 *+SP(24),B5 ; |32| LDW .D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT .L2 B5,B4,B0 ; |32| [ B0] B .S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;---------------------------------------------------------------------- ... Assembly Code without any optimization
_dwt_deinterleave: … ;** ------------------------------------------------------------------------ || MV .D2 B4,B11 .line 5 MV .D2 B11,B0 ; |28| SHRU .S2 B0,31,B4 ; |28| ADD .D2 B4,B0,B4 ; |28| SHR .S2 B4,1,B0 ; |28| MV .D2 B0,B12 ; |28| .line 6 ADD .D2 1,B11,B10 ; |29| SHRU .S2 B10,31,B4 ; |29| ADD .D2 B4,B10,B4 ; |29| SHR .S2 B4,1,B4 ; |29| MV .S1X B4,A12 ; |29| .line 7 B .S1 _malloc ; |30| MVKL .S2 RL0,B3 ; |30| SHL .S1X B11,2,A4 ; |30| MVKH .S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30| .line 8 CMPLT .L2 B10,2,B0 [ B0] B .S1 L2 ; |31| MV .D2 B10,B4 [!B0] MV .D1 A4,A3 [!B0] MV .S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** --------------------------------------------------------------------------* ;** ----------------------- U$22 = a; ;** ----------------------- U$25 = b; ;** 32 ----------------------- L$1 = K$7>>1; ;** ----------------------- X$4 = x<<3; ;** ----------------------- #pragma MUST_ITERATE(1, 1073741823, 1) .line 9 SHR .S2 B4,1,B0 ; |32| || SHL .S1 A11,3,A6 ;** -----------------------g3: ;** 32 ----------------------- *U$25++ = *U$22; ;** 32 ----------------------- U$22 += X$4; ;** 32 ----------------------- if ( --L$1 ) goto g3; SUB .D2 B0,1,B0 ; |32| L1: [ B0] B .S1 L1 ; |32| || LDW .D1T1 *A0,A5 ; |32| ADD .S1 A6,A0,A0 ; |32| [ B0] SUB .D2 B0,1,B0 ; |32| NOP 2 STW .D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** -----------------------------------------------------------------------* ... Assembly Code with speed optimization
Speed optimized code analysis • [ B0] B .S1 L1 ; |32| • || LDW .D1T1 *A0,A5 ; |32| • ADD .S1 A6,A0,A0 ; |32| • [ B0] SUB .D2 B0,1,B0 ; |32| • NOP 2 • STW .D1T1 A5,*A3++ ; |32| • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • 6*(n+1) clock cycles are needed
SHR .S2 B4,1,B0 CMPGT .L2 B0,6,B1 [ B1] B .S1 L2 SHL .S1 A10,3,A3 [!B1] SUB .D2 B0,1,B0 NOP 3 ;** --------------------------------------------------------------------------* ... ;** --------------------------------------------------------------------------* L2: ADD .S1 A3,A4,A4 || SUB .D2 B0,7,B0 || LDW .D1T1 *A4,A6 ;** --------------------------------------------------------------------------* L3: ; PIPELINED LOOP PRE-PROCESS MV .S2X A0,B4 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A4,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 [ B0] B .S1 L4 || ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 MV .S2X A6,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A0,A4 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ;** --------------------------------------------------------------------------* L4: ; PIPELINED LOOP STW .D2T2 B5,*B4++ || MV .S2X A0,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A4 || [ B0] SUB .L2 B0,1,B0 || LDW .D1T1 *A4,A0 ;** --------------------------------------------------------------------------* L5: ; PIPELINED LOOP PAST-PROCESS MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MVC .S2 B6,CSR || MV .L2X A0,B5 || STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* MV .S2X A0,B5 || STW .D2T2 B5,*B4++ STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* Assembly Code with pipeline optimization
Pipeline optimized code design • L4: ; PIPELINED LOOP • STW .D2T2 B5,*B4++ • || MV .S2X A0,B5 • || [ B0] B .S1 L4 • || ADD .L1 A3,A4,A4 • || [ B0] SUB .L2 B0,1,B0 • || LDW .D1T1 *A4,A0 • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • n+7 clock cycles are needed
optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Comparison Spring 2004
DWT Quantizer Entropy Coder JPEG2000 Lossy Image Compression Encoder Spring 2004
2 2 2 2 2 2 2 H2 H1 H2 H1 H2 H1 Hi 1-Level Wavelet Decomposition (2D DWT) LL Component (Low pass) HL Component (Low pass) Input Image (High pass) LH Component (Low pass) HH Component (High pass) (High pass) Row-wise operations Column-wise operations Filter Decimator x[n] y[n] Keep one out of two pixels Spring 2004
LL HL2 HL1 HL1 LL LH2 HH2 LH1 HH1 LH1 HH1 2D-DWT 2D-DWT Multi-Level Wavelet Decomposition Spring 2004
Thanks! Questions?