130 likes | 245 Views
Optimization Of The Quant Component For Speed. as part of the seminar “Software Streaming Architecture“. Why Performance Tuning?. decrease waiting time while decoding gain of 1 s per image unimportant for one image 90 min video (25 f/s) : 37.5 min measure time or clock cycles
E N D
Optimization Of The Quant Component For Speed as part of the seminar “Software Streaming Architecture“ Volker Martens
Why Performance Tuning? • decrease waiting time while decoding • gain of 1 s per image • unimportant for one image • 90 min video (25 f/s) : 37.5 min • measure time or clock cycles • tmsim: hard to measure time => cycles used Volker Martens
How To Measure Clock Cycles? • TriMedia custom operators • example • long start = CYCLES(); • ... • long end = CYCLES(); • printf(“This code used %d clock cycles“, end-start); • disadvantages: • increases total number of cycles • has to change sourcecode • nested measurements possible • TriMedia compiler • tmsim : runs program and saves execution statistics in <statfile> • tmsim -statfile <statfile> <executable prog.> • tmprof: generates report for each function • tmprof -scale 1 -func <statfile> <executable prog.> Volker Martens
The Start Situation - used functions in Quant.c and tmalQuant.c: Function Executions Total Cycles (%) --------------- ---------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 _CopyBlockFromFrame 288 684750 6.59 _checkrange 18144 362909 3.49 _DC_Scaler 576 51138 0.49 _QuantizeIntraDCCoef 288 39453 0.38 _QuantMacroblock 48 27507 0.26 _tmalQuantProcessData 1 14355 0.14 _tmalQuantStart 1 2332 0.02 ----------------------------------------------------- total/average 60784 10396474 100.00 - total clock cycles over all functions Volker Martens
Forms Of Performance Tuning (1) • Profile driven compilation • 1. compile with profiling code : tmcc -p <sourcefile> -o <outputfile> • 2. generate profile information : tmsim <outputfile> • 3. recompile using profile information: tmcc -r <sourcefile> -o <outputfile> • compiler performs loop unrolling and restricted pointers • changes in sourcecode require new profile • -G also performs grafting Volker Martens
Forms Of Performance Tuning (2) • loop optimization • remove IF and function calls • loop fusion • using cheaper operators • replace && and || by & resp. | • ... • using custom operators • special operations for DSP applications • manual loop unrolling • best for the most critical parts • using restricted pointers • tell compiler that pointers are not overlapping • ... Volker Martens
Performed Optimizations (1) QuantizeIntraDCTcoefMPEG (1) int checkrange (int x, int cMin, int cMax) { if (x < cMin) return cMin; if (x > cMax) return cMax; return x; } ... iScaledCoef =checkrange (iScaledCoef, -iMaxVal, iMaxVal - 1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ [i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); • checkrange() called 18144 times : inlining and custom ops. • formula with convertions from int to float and back • calls to min() and max() replaced by custom ops. Volker Martens
Performed Optimizations (2) QuantizeIntraDCTcoefMPEG (2) // old code iScaledCoef =checkrange(iScaledCoef, -iMaxVal, iMaxVal-1); iScaledQP = (int) (3.0F * (Float) iQP / 4.0F + 0.5); rgiDCTcoefQ[i] = min(iMaxAC, max(-iMaxAC, iScaledCoef)); // faster code iScaledCoef =IMIN(iScaledCoef, iMaxVal - 1); iScaledCoef =IMAX(iScaledCoef, -iMaxVal); iScaledQP = (3*iQP+2) >> 2; rgiDCTcoefQ[i] = IMIN(iMaxAC, IMAX(-iMaxAC, iScaledCoef)); - 766.000 C. - 400.000 C. ========== -1.166.000 C. Volker Martens
Performed Optimizations (3) CopyBlockFromFrame (1) for (j=0;j<blocksize;j++) { for (i=0;i<blocksize;i++) { x0 = bx*blocksize + i; y0 = by*blocksize + j; start = y0*xsize + x0; dest[j*blocksize+i] = frame[start]; } } • 1. Loop optimization • overhead reduced: computations from inner loop set before it • 2. Loop unrolling • copy done multiple times and fewer repetitions in inner loop Volker Martens
Performed Optimizations (4) CopyBlockFromFrame (2) int startdest; x0 = bx*blocksize; y0 = by*blocksize; startdest = 0; start = y0*xsize + x0; for (j=0;j<blocksize;j++) { for (i=0;i<blocksize-1;i+=4) { dest[startdest+i] = frame[start+i]; dest[startdest+i+1] = frame[start+i+1]; dest[startdest+i+2] = frame[start+i+2]; dest[startdest+i+3] = frame[start+i+3]; } startdest += blocksize; start += xsize; } - 125.000 C. - 275.000 C. ========= - 400.000 C. Parameter blocksize must be a multiple of 4 ! Volker Martens
Performed Optimizations (5) DC_Scaler If-expression rebuilt - 30.000 C. if ((a >= 1) && (a <= 4)) result = ...; else if ((a >= 5) && (a <= 8)) result = ...; else if ... if (a >= 1) if (a >= 5) if (a >= 9) ... else return ...; else return ...; return -1; QuantizeIntraDCcoef Min() and max() replaced by IMIN and IMAX - 22.000 C. tmalQuantProcessData & DC_Scaler *2 and /2 replaced by << 1 and >> 1 - 20.000 C. ========= - 72.000 C. Volker Martens
Optimization Results (1) Function Executions Total Cycles (%) Total Cycles (%) --------------- ---------- ---------------- ---------------- _QuantizeIntraDCTcoefMPEG 288 2969052 28.56 | 2170540 24.72 _CopyBlockFromFrame 288 684750 6.59 | 280202 3.19 _checkrange 18144 362909 3.49 | - - _DC_Scaler 576 51138 0.49 | 16737 0.19 _QuantizeIntraDCCoef 288 39453 0.38 | 21457 0.24 _QuantMacroblock 48 27507 0.26 | 27594 0.31 _tmalQuantProcessData 1 14355 0.14 | 14371 0.16 _tmalQuantStart 1 2332 0.02 | 2424 0.03 ------------------------------------------------------------------------------ total/average 60784 10396474 100.00 8780103 100.00 original functions optimized functions Only functions from Quant.c and tmalQuant.c Volker Martens
Optimization Results (1) • -38.0% cycles in optimized functions • -15.5% cycles over all functions Volker Martens