1 / 89

High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking. High-Performance Sequential Programming. Presented by Juan Carlos Martinez Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu. Acknowledgements.

yeva
Download Presentation

High-Performance Grid Computing and Research Networking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Performance Grid Computing and Research Networking High-Performance Sequential Programming Presented by Juan Carlos Martinez Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

  2. Acknowledgements • The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! • Henri Casanova • Principles of High Performance Computing • http://navet.ics.hawaii.edu/~casanova • henric@hawaii.edu

  3. Sequential Programs • In this class we’re mostly focusing on concurrent programs • But it’s useful to recall some simple notions of high performance for sequential programs • Because some fundamental techniques are meaningful for concurrent programs • Because in your projects you’ll have to get code to go fast, and a concurrent program is just simultaneous sequential programs • We’ll look at • Standard code optimization techniques • Optimizations dealing with memory issues

  4. Loop Constants • Identifying loop constants: for (k=0;k<N;k++) { c[i][j] += a[i][k] * b[k][j]; } sum = 0; for (k=0;k<N;k++) { sum += a[i][k] * b[k][j]; } c[i][j] = sum;

  5. Multi-dimensional Array Accesses • A static 2-D array is one declared as <type> <name>[<size>][<size>] int myarray[10][30]; • The elements of a 2-D array are stored in contiguous memory cells • The problem is that: • The array is 2-D, conceptually • Computer memory is 1-D • 1-D computer memory: a memory location is described by a single number, its address • Just like a single axis • Therefore, there must be a mapping from 2-D to 1-D • From a 2-D abstraction to a 1-D implementation

  6. Mapping from 2-D to 1-D? A 2-D to 1-D mapping n2! possible mappings Another 2-D to 1-D mapping 1-D computer memory nxn 2-D array

  7. Row-Major, Column-Major • Luckily, only 2 of the n2! mappings are ever implemented in a language • Row-Major: • Rows are stored contiguously • Column-Major: • Columns are stored contiguously 2nd row 3rd row 1st row 4th row 2nd col 3rd col 1st col 4th col

  8. Row-Major • C uses Row-Major address rows in memory memory lines memory/cache line • Matrix elements are stored in contiguous memory lines

  9. Column-Major • FORTRAN uses column-Major address columns in memory memory lines memory/cache line • Matrix elements are stored in contiguous memory lines

  10. Address Computation • Address Computation: @(a[i][j]) = @(a[0][0]) + i*N + j • Detail: there should be a sizeof() factor as well • Example: N = 6, M = 2 • @(a[2][3]) = @(a[0][0]) + 2*6 + 3 = @(a[0][0]) + 15 • For column-major (like in FORTRAN), the formula is reversed: • @(a[i][j]) = @(a[0][0]) + j*M + i, or • @(a[i][j]) = @(a[1][1]) + (j-1)*M + i-1 • Example: a MxN row-major array j i*N @(a[0][0]) j M i X X N

  11. Array Accesses are Expensive • Given that the formula is @(a[i][j]) = @(a[0][0]) + i*N + j • Each array access entailed 2 additions and 1 multiplication • This is even higher for higher dimension arrays • Therefore, when the compiler compiles the instruction sum += a[i][k] * b[k][j]; 4 integer additions and 2 integer multiplications are generated just to compute addresses! And then 1 fp multiplication and 1 fp addition • If the bottleneck is memory, then we don’t care • But if the processor is not starved for data (which we will see is possible for this application), then the overhead of computing addresses is large

  12. Removing Array Accesses • Replace array accesses by pointer dereferences for (j=0;j<N;j++) a[i][j] = 2; // 2*N adds, N multiplies double *ptr = &(a[i][0]); // 2 adds, 1 multiplies for (j=0;j<N;j++) { *ptr = 2; ptr++; // N integer addition }

  13. Loop Unrolling • Loop unrolling: for (i=0;i<100;i++) // 100 comparisons a[i] = i; i=0; do { a[i] = i; i++; a[i] = i; i++; a[i] = i; i++; a[i] = i; i++; } while (i<100) // 25 comparisons

  14. Loop Unrolling • One can unroll a loop by more (or less) than 5-fold • If the unrolling factor does not divide the number of iterations, then one must add a few iterations before the loop • Trade-off: • performance gain • code size

  15. Code Motion • Code Motion sum = 0; for (i = 0; i <= fact(n); i++) sum += i; sum = 0; f = fact(n); for (i = 0; i <= f; i++) sum += i;

  16. Inlining • Inlining: for (i=0;i<N;i++) sum += cube(i); ... void cube(i) { return (i*i*i); } for (i=0;i<N;i++) sum += i*i*i;

  17. Common Sub-expression • Common sub-expression elimination x = a + b - c; y = a + d + e + b; tmp = a + b; x = tmp - c; y = tmp + d + e;

  18. Dead code • Dead code elimination x = 12; ... x = a+c; ... x = a+c; Seems obvious, but may be “hidden” int x = 0; ... #ifdef FOO x = f(3); #else

  19. Other Techniques • Strength reduction a = i*3; a = i+i+i; • Constant propagation int speedup = 3; efficiency = 100* speedup / numprocs; x = efficiency * 2; x = 600 / numprocs;

  20. So where are we? • We have seen a few of optimization techniques but there are many other! • We could apply them all to the code but this would result in completely unreadable/undebuggable code • Fortunately, the compiler should come to the rescue • To some extent, at least • Some compiler can do a lot for you, some not so much • Typically compilers provided by a vendor can do pretty tricky optimizations

  21. What do compilers do? • All modern compilers perform some automatic optimization when generating code • In fact, you implement some of those in a graduate-level compiler class, and sometimes at the undergraduate level. • Most compilers provide several levels of optimization • -O0: No optimization • in fact some is always done • -O1, -O2, .... -OX • The higher the optimization level the higher the probability that a debugger may have trouble dealing with the code. • Always debug with -O0 • some compiler enforce that -g means -O0 • Some compiler will flat out tell you that higher levels of optimization may break some code!

  22. Compiler optimizations • In this class we use gcc, which is free and pretty good • -Os: Optimize for size • Some optimizations increase code size tremendously • Do a “man gcc” and look at the many optimization options • one can pick and choose, • or just use standard sets via O1, O2, etc. • The most fancy compilers are typically the ones done by vendors • You can’t sell a good machine if it has a bad compiler • Compiler technology used to be really poor • also, languages used to be designed without thinking of compilers (FORTRAN, Ada) • no longer true: every language designer has in-depth understanding of compiler technology today

  23. What can compilers do? • Most of the techniques we’ve seen!! • Inlining • Assignment of variables to registers • It’s a difficult problem • Dead code elimination • Algebraic simplification • Moving invariant code out of loops • Constant propagation • Control flow simplification • Instruction scheduling, reordering • Strength reduction • e.g., add to pointers, rather than doing array index computation • Loop unrolling and software pipelining • Dead store elimination • and many other......

  24. What can compilers do? • Most of the techniques we’ve seen!! • Inlining • Assignment of variables to registers • It’s a difficult problem • Dead code elimination • Algebraic simplification • Moving invariant code out of loops • Constant propagation • Control flow simplification • Instruction scheduling, reordering • Strength reduction • e.g., add to pointers, rather than doing array index computation • Loop unrolling and software pipelining • Dead store elimination • and many other......

  25. Instruction scheduling • Modern computers have multiple functional units that could be used in parallel • Or at least ones that are pipelined • if fed operands at each cycle they can produce a result at each cycle • although a computation may require 20 cycles • Instruction scheduling: • Reorder the instructions of a program • e.g., at the assembly code level • Preserve correctness • Make it possible to use functional units optimally

  26. Instruction Scheduling • One cannot just shuffle all instructions around • Preserving correctness means that data dependences are unchanged • Three types of data dependences: • True dependence a = ... ... = a • Output dependence a = ... a = ... • Anti dependence ... = a a = ...

  27. Instruction Scheduling Example ... ... ADD R1,R2,R4 ADD R1,R2,R4 ADD R2,R2,1 LOAD R4,@2 ADD R3,R6,R2 ADD R2,R2,1 LOAD R4,@2 ADD R3,R6,R2 ... ... • Since loading from memory can take many cycles, one may as well do is as early as possible • Can’t move instruction earlier because of anti-dependence on R4

  28. Software Pipelining • Fancy name for “instruction scheduling for loops” • Can be done by a good compiler • First unroll the loop • Then make sure that instructions can happen in parallel • i.e., “scheduling” them on functional units • Let’s see a simple example

  29. Example • Source code: for(i=0;i<n;i++) sum += a[i] • Loop body in assembly: • Unroll loop &allocate registers • May be very difficult r1 = L r0--- ;stall r2 = Add r2,r1r0 = Add r0,12r4 = L r3--- ;stall r2 = Add r2,r4r3 = add r3,12r7 = L r6--- ;stall r2 = Add r2,r7r6 = add r6,12r10 = L r9--- ;stall r2 = Add r2,r10r9 = add r9,12 r1 = L r0--- ;stall r2 = Add r2,r1r0 = add r0,4

  30. Example (cont.) Schedule Unrolled Instructions, exploiting instruction level parallelism if possible r1 = L r0r4 = L r3r2 = Add r2,r1 r7 = L r6r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = add r3,12 r2 = Add r2,r7 r1 = L r0r6 = add r6,12r2 = Add r2,r10 r4 = L r3r9 = add r9,12r2 = Add r2,r1 r7 = L r6r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = add r3,12 r2 = Add r2,r7 r1 = L r0r6 = add r6,12r2 = Add r2,r10 r4 = L r3r9 = add r9,12r2 = Add r2,r1 r7 = L r6. . .r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = add r3,12 r2 = Add r2,r7r6 = add r6,12 Add r2,r10 r9 = add r9,12 Identifyrepeatingpattern (kernel)

  31. Example (cont.) Loop becomes: prologue r1 = L r0r4 = L r3r2 = Add r2,r1 r7 = L r6r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = Add r3,12 r2 = Add r2,r7 r1 = L r0r6 = Add r6,12r2 = Add r2,r10 r4 = L r3r9 = Add r9,12r2 = Add r2,r1 r7 = L r6r0 = Add r0,12 r2 = Add r2,r4 r10 = L r9r3 = Add r3,12 r2 = Add r2,r7r6 = Add r6,12 Add r2,r10 r9 = Add r9,12 kernel epilogue

  32. Software Pipelining • The “kernel” may require many registers and it’s nice to know how to use as few as possible • otherwise, one may have to go to cache more, which may negate the benefits of software pipelining • Dependency constraints must be respected • May be very difficult to analyze for complex nested loops • Software pipelining with registers is a very well-known NP-hard program

  33. Limits to Compiler Optimization • Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles • e.g., data ranges may be more limited than variable types suggest • e.g., using an “int” in C for what could be an enumerated type • Most analysis is performed only within procedures • whole-program analysis is too expensive in most cases • Most analysis is based only on static information • compiler has difficulty anticipating run-time inputs • When in doubt, the compiler must be conservative • cannot perform optimization if it changes program behavior under any realizable circumstance • even if circumstances seem quite bizarre and unlikely

  34. Good practice • Writing code for high performance means working hand-in-hand with the compiler • #1: Optimize things that we know the compiler cannot deal with • For instance the “blocking” optimization for matrix multiplication may need to be done by hand • But some compiler may find the best i-j-k ordering!! • #2: Write code so that the compiler can do its optimizations • Remove optimization blockers

  35. Optimization blocker: aliasing • Aliasing: two pointers point to the same location • If a compiler can’t tell what a pointer points at, it must assume it can point at almost anything • Example: void foo(int *q, int *p) { *q = 3; *p++; *q *= 4;} cannot be safely optimized to: *p++; *q = 12; because perhaps p = q • Some compilers have pretty fancy aliasing analysis capabilities

  36. Blocker: Function Call sum = 0; for (i = 0; i <= fact(n); i++) sum += i; • A compiler cannot optimize this because • function fact may have side-effects • e.g., modifies global variables • Function May Not Return Same Value for Given Arguments • Depends on other parts of global state, which may be modified in the loop • Why doesn’t compiler look at the code for fact? • Linker may overload with different version • Unless declared static • Interprocedural optimization is not used extensively due to cost • Inlining can achieve the same effect for small procedures • Again: • Compiler treats procedure call as a black box • Weakens optimizations in and around them

  37. Other Techniques • Use more local variables while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++; } Helps some compilers register float f0 = filter[0]; register float f1 = filter[1]; register float f2 = filter[2]; while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++; }

  38. Other Techniques • Replace pointer updates for strided memory addressing with constant array offsets f0 = *r8; r8 += 4; f1 = *r8; r8 += 4; f2 = *r8; r8 += 4; Some compilers are better at figuring this out than others Some systems may go faster with option #1, some others with option #2! f0 = r8[0]; f1 = r8[4]; f2 = r8[8]; r8 += 12;

  39. Bottom line • Know your compilers • Some are great • Some are not so great • Some will not do things that you think they should do • often because you forget about things like aliasing • There is not golden rule because there are some system-dependent behaviors • Although the general principles typically holds • Doing all optimization by hand is a bad idea in general

  40. By-hand Optimization is bad? for(i = 0; i < SIZE; i++) { for(j = 0; j < SIZE; j++) { for(k = 0; k < SIZE; k++) { c[i][j]+=a[i][k]*b[k][j]; } } } for(i = 0; i < SIZE; i++) { int *orig_pa = &a[i][0]; for(j = 0; j < SIZE; j++) { int *pa = orig_pa; int *pb = &a[0][j]; int sum = 0; for(k = 0; k < SIZE; k++) { sum += *pa * *pb; pa++; pb += SIZE; } c[i][j] = sum; } } • Turned array accesses into pointer dereferences • Assign to each element of c just once

  41. Results (Courtesy of CMU)

  42. Why is Simple Sometimes Better? • Easier for humans and the compiler to understand • The more the compiler knows the more it can do • Pointers are hard to analyze, arrays are easier • You never know how fast code will run until you time it • The transformations done by hand good optimizers will often do for us • And they will often do a better job than we can do • Pointers may cause aliases and data dependences where the array code had none

  43. Bottom Line How should I write my programs, given that I have a good, optimizing compiler? • Don’t: Smash Code into Oblivion • Hard to read, maintain & ensure correctness • Do: • Select best algorithm • Write code that’s readable & maintainable • Procedures, recursion, without built-in constant limits • Even though these factors can slow down code • Eliminate optimization blockers • Allows compiler to do its job • Account for cache behavior • Focus on Inner Loops • Use a profiler to find important ones!

  44. Memory • One constant issue that unfortunately compilers do not do very well with is memory and locality • Although some recent compilers have gotten pretty smart about it • Let’s look at this in detail because the ideas apply strongly to high performance for concurrent programs • No point in writing a concurrent program if its sequential components are egregiously suboptimal

  45. The Memory Hierarchy C a c h e C a c h e C a c h e larger, slower, cheaper • Spatial locality: having accessed a location, a nearby location is likely to be accessed next • Therefore, if one can bring in contiguous data items “close” to the processor at once, then perhaps a sequence of instructions will find them ready for use • Temporal locality: having accessed a location , this location is likely to be accessed again • Therefore, if one can keep recently accessed data items “close” to the processor, then perhaps the next instructions will fin them ready for use. CPU Numbers roughly based on 2005 Intel P4 processors with multi GHz clock rates Memory disk regs register reference L1-cache (SRAM) reference L2-cache (SRAM) reference L3-cache (DRAM) reference memory (DRAM) reference disk reference hundreds cycles tens of thousands cycles sub ns 1-2 cycles 10 cycles 20 cycles

  46. Caches • There are many issues regarding cache design • Direct-mapped, associative • Write-through, Write-back • How many levels • etc. • But this belongs to a computer architecture class • Question: Why should the programmer care? • Answer: Because code can be re-arranged to improve locality • And thus to improve performance

  47. Example #1: 2-D Array Initialization int a[200][200]; int a[200][200]; for (i=0;i<200;i++) { for (j=0;j<200;j++) { for (j=0;j<200;j++) { for (i=0;i<200;i++) { a[i][j] = 2; a[i][j] = 2; } } } } • Which alternative is best? • i,j? • j,i? • To answer this, one must understand the memory layout of a 2-D array

  48. Row-Major • C uses Row-Major • First option int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2; • Second option int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2;

  49. Counting cache misses • nxn 2-D array, element size = e bytes, cache line size = b bytes memory/cache line • One cache miss for every cache line: n2 x e /b • Total number of memory accesses: n2 • Miss rate: e/b • Example: Miss rate = 4 bytes / 64 bytes = 6.25% • Unless the array is very small memory/cache line • One cache miss for every access • Example: Miss rate = 100% • Unless the array is very small

  50. Array Initialization in C • First option int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2; • Second option int a[200][200]; for (i=0;i<200;i++) for (j=0;j<200;j++) a[i][j]=2; Good Locality

More Related