Mooly Sagiv

Compiler Optimizations for Memory HierarchyChapter 20http://research.microsoft.com/~trishulc/http://www.cs.umd.edu/~tseng/High Performance Compilers forParellel Computing (Wolfe) Mooly Sagiv

Outline • Motivation • Instruction Cache Optimizations • Scalar Replacement of Aggregates • Data Cache Optimizations • Where does it fit in a compiler • Complementary Techniques • Preliminary Conclusion

Motivation • Every year • CPUs are improving by 50%-60% • Main memory speed is improving 10% • So what? • What can we do? • Programmers • Compiler writers • Operating system designers • Hardware architectures

A Typical Machine CPU memory bus Cache Main Memory Bus adaptor CPU I/O bus I/O controler I/O controler I/O controler network Graphics output Disk Disk

Types of Locality in Programs • Temporal Locality • The same data is accessed many times in successive instructions • Example: while (…) { x = x + a; } • Spatial Locality • “Nearby” memory locations are accessed many times in successive instructions • Examplefor (i = 1; i < n; i++) { x[i] = x[i] + a; }

Compiler Optimizations forMemory Hierarchy • Register allocation (Chapter 16) • Improve locality • Improve branch predication • Software prefetching • Improve memory allocation

A Reasonable Assumption • The machine has two separate caches • Instruction cache • Data cache • Employ different compiler optimizations • Instruction cache optimizations • Data Cache optimizations

Instruction-Cache Optimizations • Instruction Prefecthing • Procedure Sorting • Procedure and Block Placement • Intraprocedural Code Positioning(Pettis & Hensen 1990) • Procedure Splitting • Tailored for specific cache policy

Instruction Prefetching • Many machines prefetch instruction of blocks predicted to be executed • Some RISC architectures support “software” prefecth • iprefetch address (Sparc-V9) • Criteria for inserting prefetching • Tprefetch - The latency of prefecting • t - The time that the address is known

Procedure Sorting • Interprocedural Optimization • Place the caller and the callee close to each other • Applies for statically linked procedures • Create “undirected” call graph • Label arcs with execution frequencies • Use a greedy approach to select neighboring procedures

50 P1 P2 50 P5 40 100 20 P3 P4 5 3 32 90 P7 P6 40 P8

Intraprocedural Code Positioning • Move infrequently executed code out of main body • “Straighten” the code • Higher fraction of fetched instructions are actually executed • Operates on a control flow graph • Edges are annotated with execution frequencies • Cover the graph with traces

Intraprocedural Code Positioning • Input • Contrtol flow graph • Edges are annotated with execution frequencies • Bottom-up trace selection • Initially each basic block is a trace • Combine traces with the maximal edge from tail to head • Place traces from entry • Traces with many outgoing edges appear earlier • Successive traces are close • Fix up the code by inserting and deleting branches

entry 20 30 B1 45 10 14 B2 B3 40 14 5 10 B4 B5 B6 B7 5 10 10 B8 B9 15 10 exit

Procedure Splitting • Enhances the effectiveness of • Procedure sorting • Code positioning • Divides procedures into “hot” and “cold” parts • Place hot code in a separate section

Scalar Replacement of Array Elements • Reduce the number of memory accesses • Improve the effectiveness of register allocation do i= 1..N do j=1..N do k=1..N C(i, j)= C(i, j) + A(i, k) * B(k, j) endo endo endo

Data-Cache Optimizations • Loop transformations • Re-arrange loops in scientific code • Allow parallel/pipelined/vector execution • Improve locality • Data placement of dynamic storage • Software prefetching

Unimodular transformations Loop Transformations • Loop interchange • Loop permutation • Loop skewing • Loop fusion • Loop distribution • Loop tiling

Tiling • Perform array operations in small blocks • Rearrange the loops so that innermost loops fits in cache (due to fewer iterations) • Allow reuse in all tiled dimensions • Padding may be required to avoid cache conflicts

do i= 1..N, T do j=1..N, T do k=1..N, T do ii=i, min(i+T-1, N) do jj=j, min(j+T-1, N) do kk=k, min(k+T-1, N) C(ii, jj)= C(ii, jj) + A(ii, kk) * B(kk, jj) endo endo endo endo endo endo

Dynamic storage • Improve special locality at allocation time • Examples • Use type of data structure at malloc • Reorganize heap • Allocate the parent of tree node and the node close • Useful information • Types • Traversal patterns • Research Frontier

void addList(struct List *list; struct Patient *patient) { struct list *b; while (list !=NULL) { b = list ; list = list->forward; } list = (struct List *)= ccmaloc(sizeof(struct List), b); list->patient = patient; list->back= b; list->forward=NULL; b->forward=list; }

Software Prefetching • Requires special hardware (Alpha, PowerPC, Sparc-V9) • Reduces the cost of subsequent accesses in loops • Not limited to scientific code • More effective for large memory bandwidth

struct node {int val; struct node *next ; struct node *jump; } … ptr= the_list->head; while (ptr->next) { prefetch(ptr->jump); … ptr= ptr->next struct node {int val; struct node *next ; } … ptr= the_list->head; while (ptr->next) { … ptr= ptr->next

Scalar replacement of array references Data-cache optimizations A HIR Procedure integration … B HIR|MIR Global value numbering … In-line expansion … Interprocedural register allocation … C MIR|LIR D LIR E link-time Textbook Order constant-folding simplifications

LIR(D) Inline expansion Leaf-routine optimizations Shrink wrapping Machine idioms Tail merging Branch optimization and conditional moves Dead code elimination Software pipelining, … Instruction Scheduling 1 Register allocation Instruction Scheduling 2 Intraprocedural I-cache optimizations Instruction prefetching Data prefertching Branch predication constant-folding simplifications

Link-time optimizations(E) Interprocedural register allocation Aggregation global references Interprcudural I-cache optimizations

Complementary Techniques • Cache aware data structures • Smart hardware • Cache aware garbage collection

Preliminary Conclusion • For imperative programs current I-cache optimizations suffice to get good speed-ups (10%) • For D-cache optimizations: • Locality optimizations are effective for regular scientific code (46%) • Software prefetching is effective with large memory bandwidth • For pointer chasing programs more research is needed • Memory optimizations is a profitable area

Mooly Sagiv

Mooly Sagiv

Presentation Transcript

From QTL to QTG: Are we getting closer? Sagiv Shifman and Ariel Darvasi

Joint work with Josh Berdine, Byron Cook, and Mooly Sagiv

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er -Gang Liu

Ms Mooly Wong, Part-time Professional Consultant Family and Group Practice Research Centre,

Amir Sagiv, Eli Waxman & Abraham Loeb

Jacob Sagiv Department of Materials and Interfaces Weizmann Institute

Mooly Sagiv

Mooly Sagiv

Presentation Transcript

From QTL to QTG: Are we getting closer? Sagiv Shifman and Ariel Darvasi

Joint work with Josh Berdine, Byron Cook, and Mooly Sagiv

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er -Gang Liu

Ms Mooly Wong, Part-time Professional Consultant Family and Group Practice Research Centre,

Amir Sagiv, Eli Waxman &amp; Abraham Loeb

Jacob Sagiv Department of Materials and Interfaces Weizmann Institute

Amir Sagiv, Eli Waxman & Abraham Loeb