ECE 734 VLSI Array Structures for Digital Signal Processing

Topic: Implementation of JPEG 2000 component algorithm—DWT in TI TMS32060 Team Members: Peng Zhang and Xun Zhang Advisor: Yu Hen Hu ECE 734 VLSI Array Structures for Digital Signal Processing Spring 2004

Agenda • Abstract • DWT C implementation • DWT TMS320 C62 Assembly Code • Without optimization • Speed optimization • Pipeline optimization (by us) • Result comparison • Jpeg 2000 and DWT (if we have free time)

In this project, we would like to implement and optimize DWT algorithm ,which is used as a key algorithm in JPEG2000, on TI TMS320C62 platform. 1st Step, we implemented 2D DWT algorithm by C code; 2nd Step, we implemented 2D DWT algorithm at TI TMS320C62 platform 2 times, without any optimization and with the fastest speed optimization; 3rd Step, we did advanced optimization to assembly code, mainly used pipeline; 4th Step, we compare the performance between before and after our optimization. Abstract Spring 2004

... #define S(i) a[x*(i)*2] ... void dwt_deinterleave(int *a, int n, int x) { int dn, sn, i; int *b; dn=n/2; sn=(n+1)/2; b=(int*)malloc(n*sizeof(int)); for (i=0; i<sn; i++) b[i]=a[2*i*x]; ... } /// Forward wavelet tranform in 1-D. void dwt_encode_1(int *a, int n, int x) { ... dwt_deinterleave(a, n, x); } /// Forward wavelet tranform in 2-D. void dwt_encode(int *a, int w, int h, int l) { int i, j, rw, rh; for (i=0; i<l; i++) { rw=int_ceildivpow2(w, i); rh=int_ceildivpow2(h, i); for (j=0; j<rw; j++) dwt_encode_1(a+j, rh, w); ... } } void main() { ... dwt_encode(image[0], 200, 165, 8); ... } C code Implementation Spring 2004

;----------------------------------------------------------------------;---------------------------------------------------------------------- ; 24 | void dwt_deinterleave(int *a, int n, int x) ;---------------------------------------------------------------------- _dwt_deinterleave: ;** --------------------------------------------------------------------------* ... ;---------------------------------------------------------------------- ; 31 | for (i=0; i<sn; i++) ;---------------------------------------------------------------------- ZERO .D2 B4 ; |31| STW .D2T2 B4,*+SP(24) ; |31| LDW .D2T2 *+SP(24),B5 ; |31| LDW .D2T2 *+SP(20),B4 ; |31| NOP 4 CMPLT .L2 B5,B4,B0 ; |31| [!B0] B .S1 L2 ; |31| NOP 5 ; BRANCH OCCURS ; |31| L1: .line 9 ; 32 | b[i]=a[2*i*x]; ;---------------------------------------------------------------------- LDW .D2T2 *+SP(24),B4 ; |32| LDW .D2T2 *+SP(12),B5 ; |32| LDW .D2T2 *+SP(4),B6 ; |32| NOP 2 ADD .D2 B4,B4,B4 MPYLH .M2 B5,B4,B8 ; |32| MPYLH .M2 B4,B5,B7 ; |32| MPYU .M2 B5,B4,B5 ; |32| ADD .D2 B8,B7,B4 ; |32| SHL .S2 B4,16,B4 ; |32| ADD .S2 B5,B4,B4 ; |32| || LDW .D2T2 *+SP(28),B7 ; |32| LDW .D2T2 *+B6[B4],B4 ; |32| LDW .D2T2 *+SP(24),B5 ; |32| NOP 4 STW .D2T2 B4,*+B7[B5] ; |32| LDW .D2T2 *+SP(24),B4 ; |32| NOP 4 ADD .D2 1,B4,B4 ; |32| STW .D2T2 B4,*+SP(24) ; |32| LDW .D2T2 *+SP(24),B5 ; |32| LDW .D2T2 *+SP(20),B4 ; |32| NOP 4 CMPLT .L2 B5,B4,B0 ; |32| [ B0] B .S1 L1 ; |32| NOP 5 ; BRANCH OCCURS ; |32| ;---------------------------------------------------------------------- ... Assembly Code without any optimization

_dwt_deinterleave: … ;** ------------------------------------------------------------------------ || MV .D2 B4,B11 .line 5 MV .D2 B11,B0 ; |28| SHRU .S2 B0,31,B4 ; |28| ADD .D2 B4,B0,B4 ; |28| SHR .S2 B4,1,B0 ; |28| MV .D2 B0,B12 ; |28| .line 6 ADD .D2 1,B11,B10 ; |29| SHRU .S2 B10,31,B4 ; |29| ADD .D2 B4,B10,B4 ; |29| SHR .S2 B4,1,B4 ; |29| MV .S1X B4,A12 ; |29| .line 7 B .S1 _malloc ; |30| MVKL .S2 RL0,B3 ; |30| SHL .S1X B11,2,A4 ; |30| MVKH .S2 RL0,B3 ; |30| NOP 2 RL0: ; CALL OCCURS ; |30| .line 8 CMPLT .L2 B10,2,B0 [ B0] B .S1 L2 ; |31| MV .D2 B10,B4 [!B0] MV .D1 A4,A3 [!B0] MV .S1 A10,A0 NOP 2 ; BRANCH OCCURS ; |31| ;** --------------------------------------------------------------------------* ;** ----------------------- U$22 = a; ;** ----------------------- U$25 = b; ;** 32 ----------------------- L$1 = K$7>>1; ;** ----------------------- X$4 = x<<3; ;** ----------------------- #pragma MUST_ITERATE(1, 1073741823, 1) .line 9 SHR .S2 B4,1,B0 ; |32| || SHL .S1 A11,3,A6 ;** -----------------------g3: ;** 32 ----------------------- *U$25++ = *U$22; ;** 32 ----------------------- U$22 += X$4; ;** 32 ----------------------- if ( --L$1 ) goto g3; SUB .D2 B0,1,B0 ; |32| L1: [ B0] B .S1 L1 ; |32| || LDW .D1T1 *A0,A5 ; |32| ADD .S1 A6,A0,A0 ; |32| [ B0] SUB .D2 B0,1,B0 ; |32| NOP 2 STW .D1T1 A5,*A3++ ; |32| ; BRANCH OCCURS ; |32| ;** -----------------------------------------------------------------------* ... Assembly Code with speed optimization

Speed optimized code analysis • [ B0] B .S1 L1 ; |32| • || LDW .D1T1 *A0,A5 ; |32| • ADD .S1 A6,A0,A0 ; |32| • [ B0] SUB .D2 B0,1,B0 ; |32| • NOP 2 • STW .D1T1 A5,*A3++ ; |32| • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • 6*(n+1) clock cycles are needed

SHR .S2 B4,1,B0 CMPGT .L2 B0,6,B1 [ B1] B .S1 L2 SHL .S1 A10,3,A3 [!B1] SUB .D2 B0,1,B0 NOP 3 ;** --------------------------------------------------------------------------* ... ;** --------------------------------------------------------------------------* L2: ADD .S1 A3,A4,A4 || SUB .D2 B0,7,B0 || LDW .D1T1 *A4,A6 ;** --------------------------------------------------------------------------* L3: ; PIPELINED LOOP PRE-PROCESS MV .S2X A0,B4 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A4,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 [ B0] B .S1 L4 || ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ADD .L1 A3,A0,A0 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 || [ B0] B .S1 L4 MV .S2X A6,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A0,A4 || [ B0] SUB .D2 B0,1,B0 || LDW .D1T1 *A0,A0 ;** --------------------------------------------------------------------------* L4: ; PIPELINED LOOP STW .D2T2 B5,*B4++ || MV .S2X A0,B5 || [ B0] B .S1 L4 || ADD .L1 A3,A4,A4 || [ B0] SUB .L2 B0,1,B0 || LDW .D1T1 *A4,A0 ;** --------------------------------------------------------------------------* L5: ; PIPELINED LOOP PAST-PROCESS MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MV .S2X A0,B5 || STW .D2T2 B5,*B4++ MVC .S2 B6,CSR || MV .L2X A0,B5 || STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* MV .S2X A0,B5 || STW .D2T2 B5,*B4++ STW .D2T2 B5,*B4++ ;** --------------------------------------------------------------------------* Assembly Code with pipeline optimization

Pipeline optimized code design • L4: ; PIPELINED LOOP • STW .D2T2 B5,*B4++ • || MV .S2X A0,B5 • || [ B0] B .S1 L4 • || ADD .L1 A3,A4,A4 • || [ B0] SUB .L2 B0,1,B0 • || LDW .D1T1 *A4,A0 • for (i=0; i<sn; i++) b[i]=a[2*i*x]; • Assume sn = n+1 • n+7 clock cycles are needed

optimized code with speed (by C6) vs. optimized code with pipeline (by us) for (i=0; i<sn; i++) b[i]=a[2*i*x]; Assumed sn = n+1; Result: Speed optimized code used 6(n+1) clock cycles Pipeline optimized code used n+7 clock cycles Comparison Spring 2004

DWT Quantizer Entropy Coder JPEG2000 Lossy Image Compression Encoder Spring 2004

2 2 2 2 2 2 2 H2 H1 H2 H1 H2 H1 Hi 1-Level Wavelet Decomposition (2D DWT) LL Component (Low pass) HL Component (Low pass) Input Image (High pass) LH Component (Low pass) HH Component (High pass) (High pass) Row-wise operations Column-wise operations Filter Decimator x[n] y[n] Keep one out of two pixels Spring 2004

LL HL2 HL1 HL1 LL LH2 HH2 LH1 HH1 LH1 HH1 2D-DWT 2D-DWT Multi-Level Wavelet Decomposition Spring 2004

Thanks! Questions?

ECE 734 VLSI Array Structures for Digital Signal Processing

ECE 734 VLSI Array Structures for Digital Signal Processing

Presentation Transcript

EcE 5013 Digital Signal Processing

Digital Signal Processing

ECE 734

DIGITAL SIGNAL PROCESSING

EcE 5013 Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Signal Processing

Digital Audio Signal Processing Lecture-2: Microphone Array Processing

DIGITAL SIGNAL PROCESSING

Digital signal Processing

Digital Signal Processing

Digital Signal Processing

VLSI Signal Processing

Digital Signal Processing

VL7101 VLSI SIGNAL PROCESSING

Digital Signal Processing

VLSI SIGNAL PROCESSING

Digital signal processing