130 likes | 274 Views
SYSC 5704 Elements of Computer Systems. Optimization to take advantage of hardware. Fall 2011. SYSC 5704: Elements of Computer Systems. 1. Objectives. Simple things to speed up your program. Optimize Watch procedure calls Code motion Strength reduction Common expression removal .
E N D
SYSC 5704Elements of Computer Systems Optimization to take advantage of hardware. Fall 2011 SYSC 5704: Elements of Computer Systems 1
Objectives Simple things to speed up your program. • Optimize • Watch procedure calls • Code motion • Strengthreduction • Common expression removal.
ExampleMatrix Multiplication Best code 160x This code is not obviously stupid Triple loop Standard desktop computer, compiler, using optimization flags Both implementations have exactly the same operations count (2n3) What is going on?
How didthey do it? • Multiple threads (4x) • Vector instructions (4x) • Memory hierarchy and otheroptimizations (20x) • Blocking or tiling, loopunrolling, arrayscalarization, instruction scheduling. • More instruction levelparallelism, betterregister usage, less L1/L2 cache misses, Less TLB misses.
The effect of naive coding One canlosefrom 10-100x performance or more! • Algorithm (O(n2) vs O(logn) • Coding style (toomanyprocedure calls, reordering, unrolling. • Algorithm structure (locality, instruction levelparallelism). • Data representation. This iswhyweneed to understand computer architecture!
Hint 1 : Use the optimizer! double a[4][4]; double b[4][4]; double c[4][4]; # set to zero /* Multiply 4 x 4 matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) for (k = 0; k < 4; k++) c[i*4+j] += a[i*4 + k]*b[k*4 + j]; } Compiledwithout flags: ~1300 cycles Compiledwith –O3 –m64 -march=… –fno-tree-vectorize~150 cycles Core 2 Duo, 2.66 GHz
Roadblocks The compiler is conservative. Aliasing (pointers) causes troubles. Whole program optimizationistooexpensive.
Hint 2 : Procedure calls Small procedures are better for software engineering, but canbecostly. • Costs go way up if the procedurechecksits arguments. • Check boundsoutside of loop, and design by contract. In-line!
Reducefrequencywithwhich computation isperformed If itwillalwaysproducesameresult Especiallymoving code out of loop Sometimesalsocalledprecomputation or hoisting. Hint 3 : Code Motion void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } long j; int ni = n*i; for (j = 0; j < n; j++) a[ni+j] = b[j];
Strlenloopsuntilitfinds a null, sowe call this over and over again! Move call to strlenoutside of loopsinceresultdoes not change from one iteration to another What about 'A' - 'a'? Even worse example! void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } void lower(char *s) { int i; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); }
Time quadruples when string lengthis double - Quadratic performance Hoistingresults in linear performance. Performance CPU Seconds (log scale) String Length (log scale)
Replace costlyoperationwithsimpler one Example: Shift/addinstead of multiply or divide 16*x → x << 4 Utility machine dependent Depends on cost of multiply or divide instruction On Pentium IV, integermultiplyrequires 10 CPU cycles Example: Recognizesequence of products Hint 4 : Strength Reduction for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; }
Reuse portions of expressions (factoring!) Compilersoften not verysophisticated in exploitingarithmeticproperties Hint 5 : Share Common Subexpressions 3 mults: i*n, (i–1)*n, (i+1)*n 1 mult: i*n /* Sum neighbors of i,j */ up = val[(i-1)*n + j ]; down = val[(i+1)*n + j ]; left = val[i*n + j-1]; right = val[i*n + j+1]; sum = up + down + left + right; int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; sum = up + down + left + right;