290 likes | 428 Views
Compiler Optimizations for Memory Hierarchy Chapter 20 http://research.microsoft.com/~trishulc/ http://www.cs.umd.edu/~tseng/ High Performance Compilers for Parellel Computing (Wolfe). Mooly Sagiv. Outline. Motivation Instruction Cache Optimizations Scalar Replacement of Aggregates
E N D
Compiler Optimizations for Memory HierarchyChapter 20http://research.microsoft.com/~trishulc/http://www.cs.umd.edu/~tseng/High Performance Compilers forParellel Computing (Wolfe) Mooly Sagiv
Outline • Motivation • Instruction Cache Optimizations • Scalar Replacement of Aggregates • Data Cache Optimizations • Where does it fit in a compiler • Complementary Techniques • Preliminary Conclusion
Motivation • Every year • CPUs are improving by 50%-60% • Main memory speed is improving 10% • So what? • What can we do? • Programmers • Compiler writers • Operating system designers • Hardware architectures
A Typical Machine CPU memory bus Cache Main Memory Bus adaptor CPU I/O bus I/O controler I/O controler I/O controler network Graphics output Disk Disk
Types of Locality in Programs • Temporal Locality • The same data is accessed many times in successive instructions • Example: while (…) { x = x + a; } • Spatial Locality • “Nearby” memory locations are accessed many times in successive instructions • Examplefor (i = 1; i < n; i++) { x[i] = x[i] + a; }
Compiler Optimizations forMemory Hierarchy • Register allocation (Chapter 16) • Improve locality • Improve branch predication • Software prefetching • Improve memory allocation
A Reasonable Assumption • The machine has two separate caches • Instruction cache • Data cache • Employ different compiler optimizations • Instruction cache optimizations • Data Cache optimizations
Instruction-Cache Optimizations • Instruction Prefecthing • Procedure Sorting • Procedure and Block Placement • Intraprocedural Code Positioning(Pettis & Hensen 1990) • Procedure Splitting • Tailored for specific cache policy
Instruction Prefetching • Many machines prefetch instruction of blocks predicted to be executed • Some RISC architectures support “software” prefecth • iprefetch address (Sparc-V9) • Criteria for inserting prefetching • Tprefetch - The latency of prefecting • t - The time that the address is known
Procedure Sorting • Interprocedural Optimization • Place the caller and the callee close to each other • Applies for statically linked procedures • Create “undirected” call graph • Label arcs with execution frequencies • Use a greedy approach to select neighboring procedures
50 P1 P2 50 P5 40 100 20 P3 P4 5 3 32 90 P7 P6 40 P8
Intraprocedural Code Positioning • Move infrequently executed code out of main body • “Straighten” the code • Higher fraction of fetched instructions are actually executed • Operates on a control flow graph • Edges are annotated with execution frequencies • Cover the graph with traces
Intraprocedural Code Positioning • Input • Contrtol flow graph • Edges are annotated with execution frequencies • Bottom-up trace selection • Initially each basic block is a trace • Combine traces with the maximal edge from tail to head • Place traces from entry • Traces with many outgoing edges appear earlier • Successive traces are close • Fix up the code by inserting and deleting branches
entry 20 30 B1 45 10 14 B2 B3 40 14 5 10 B4 B5 B6 B7 5 10 10 B8 B9 15 10 exit
Procedure Splitting • Enhances the effectiveness of • Procedure sorting • Code positioning • Divides procedures into “hot” and “cold” parts • Place hot code in a separate section
Scalar Replacement of Array Elements • Reduce the number of memory accesses • Improve the effectiveness of register allocation do i= 1..N do j=1..N do k=1..N C(i, j)= C(i, j) + A(i, k) * B(k, j) endo endo endo
Data-Cache Optimizations • Loop transformations • Re-arrange loops in scientific code • Allow parallel/pipelined/vector execution • Improve locality • Data placement of dynamic storage • Software prefetching
Unimodular transformations Loop Transformations • Loop interchange • Loop permutation • Loop skewing • Loop fusion • Loop distribution • Loop tiling
Tiling • Perform array operations in small blocks • Rearrange the loops so that innermost loops fits in cache (due to fewer iterations) • Allow reuse in all tiled dimensions • Padding may be required to avoid cache conflicts
do i= 1..N, T do j=1..N, T do k=1..N, T do ii=i, min(i+T-1, N) do jj=j, min(j+T-1, N) do kk=k, min(k+T-1, N) C(ii, jj)= C(ii, jj) + A(ii, kk) * B(kk, jj) endo endo endo endo endo endo
Dynamic storage • Improve special locality at allocation time • Examples • Use type of data structure at malloc • Reorganize heap • Allocate the parent of tree node and the node close • Useful information • Types • Traversal patterns • Research Frontier
void addList(struct List *list; struct Patient *patient) { struct list *b; while (list !=NULL) { b = list ; list = list->forward; } list = (struct List *)= ccmaloc(sizeof(struct List), b); list->patient = patient; list->back= b; list->forward=NULL; b->forward=list; }
Software Prefetching • Requires special hardware (Alpha, PowerPC, Sparc-V9) • Reduces the cost of subsequent accesses in loops • Not limited to scientific code • More effective for large memory bandwidth
struct node {int val; struct node *next ; struct node *jump; } … ptr= the_list->head; while (ptr->next) { prefetch(ptr->jump); … ptr= ptr->next struct node {int val; struct node *next ; } … ptr= the_list->head; while (ptr->next) { … ptr= ptr->next
Scalar replacement of array references Data-cache optimizations A HIR Procedure integration … B HIR|MIR Global value numbering … In-line expansion … Interprocedural register allocation … C MIR|LIR D LIR E link-time Textbook Order constant-folding simplifications
LIR(D) Inline expansion Leaf-routine optimizations Shrink wrapping Machine idioms Tail merging Branch optimization and conditional moves Dead code elimination Software pipelining, … Instruction Scheduling 1 Register allocation Instruction Scheduling 2 Intraprocedural I-cache optimizations Instruction prefetching Data prefertching Branch predication constant-folding simplifications
Link-time optimizations(E) Interprocedural register allocation Aggregation global references Interprcudural I-cache optimizations
Complementary Techniques • Cache aware data structures • Smart hardware • Cache aware garbage collection
Preliminary Conclusion • For imperative programs current I-cache optimizations suffice to get good speed-ups (10%) • For D-cache optimizations: • Locality optimizations are effective for regular scientific code (46%) • Software prefetching is effective with large memory bandwidth • For pointer chasing programs more research is needed • Memory optimizations is a profitable area