500 likes | 612 Views
Compiling for Parallel Machines. Kathy Yelick. CS264. Two General Research Goals. Correctness: help programmers eliminate bugs Analysis to detect bugs statically (and conservatively) Tools such as debuggers to help detect bugs dynamically Performance: help make programs run faster
E N D
Compiling for Parallel Machines Kathy Yelick CS264
Two General Research Goals • Correctness: help programmers eliminate bugs • Analysis to detect bugs statically (and conservatively) • Tools such as debuggers to help detect bugs dynamically • Performance: help make programs run faster • Static compiler optimizations • May use analyses similar to above to ensure compiler is correctly transforming code • In many areas, the open problem is determining which transformations should be applied when • Link or load-time optimizations, including object code translation • Feedback-directed optimization • Runtime optimization • For parallel machines, if you can’t get good performance, what’s the point?
A Little History • Most research on compiling for parallel machines is • automatic parallelization of serial code • loop-level parallelization (usually Fortran) • Most parallel programs are written using explicit parallelism, either • Message passing with a single processor multiple data (SPMD) model • ) usually MPI with either Fortran or mixed C++ and Fortran for scientific applications • ) shared memory with a thread and synchronization library in C or Java for non-scientific applications • Option B is easier to program, but requires hardware support that is still unproven for more than 200 processors
Titanium Overview • Give programmers a global address space • Useful for building large complex data structures that are spread over the machine • But, don’t pretend it will have uniform access time (I.e., not quite shared memory) • Use an explicit parallelism model • SPMD for simplicity • Extend a “standard” language with data structures for specific problem domain, grid-based scientific applications • Small amount of syntax added for ease of programming • General idea: build domain-specific features into the language and optimization framework
Titanium Goals • Performance • close to C/FORTRAN + MPI or better • Portability • develop on uniprocessor, then SMP, then MPP/Cluster • Safety • as safe as Java, extended to parallel framework • Expressiveness • close to usability of threads • add minimal set of features • Compatibility, interoperability, etc. • no gratuitous departures from Java standard
Titanium Goals • Performance • close to C/FORTRAN + MPI or better • Safety • as safe as Java, extended to parallel framework • Expressiveness • close to usability of threads • add minimal set of features • Compatibility, interoperability, etc. • no gratuitous departures from Java standard
Titanium • Take the best features of threads and MPI • global address space like threads (ease programming) • SPMD parallelism like MPI (for performance) • local/global distinction, i.e., layout matters (for performance) • Based on Java, a cleaner C++ • classes, memory management • Language is extensible through classes • domain-specific language extensions • current support for grid-based computations, including AMR • Optimizing compiler • communication and memory optimizations • synchronization analysis • cache and other uniprocessor optimizations
New Language Features • Scalable parallelism • SPMD model of execution with global address space • Multidimensional arrays • points and index sets as first-class values to simplify programs • iterators for performance • Checked Synchronization • single-valued variables and globally executed methods • Global Communication Library • Immutable classes • user-definable non-reference types for performance • Operator overloading • by demand from our user community • Semi-automated zone-based memory management • as safe as a garbage-collected language • better parallel performance and scalability
Lecture Outline • Language and compiler support for uniprocessor performance • Immutable classes • Multidimensional Arrays • foreach • Language support for parallel computation • Analysis of parallel code • Summary and future directions
Java: A Cleaner C++ • Java is an object-oriented language • classes (no standalone functions) with methods • inheritance between classes; multiple interface inheritance only • Documentation on web at java.sun.com • Syntax similar to C++ class Hello { public static void main (String [] argv) { System.out.println(“Hello, world!”); } } • Safe • Strongly typed: checked at compile time, no unsafe casts • Automatic memory management • Titanium is (almost) strict superset
Java Objects • Primitive scalar types: boolean, double, int, etc. • implementations will store these on the program stack • access is fast -- comparable to other languages • Objects: user-defined and from the standard library • passed by pointer value (object sharing) into functions • has level of indirection (pointer to) implicit • simple model, but inefficient for small objects 2.6 3 true r: 7.1 i: 4.3
Java Object Example class Complex { private double real; private double imag; public Complex(double r, double i) { real = r; imag = i; } public Complex add(Complex c) { return new Complex(c.real + real, c.imag + imag); public double getReal {return real; } public double getImag {return imag;} } Complex c = new Complex(7.1, 4.3); c = c.add(c); class VisComplex extends Complex { ... }
Immutable Classes in Titanium • For small objects, would sometimes prefer • to avoid level of indirection • pass by value (copying of entire object) • especially when objects are immutable -- fields are unchangeable • extends the idea of primitive values (1, 4.2, etc.) to user-defined values • Titanium introduces immutable classes • all fields are final (implicitly) • cannot inherit from (extend) or be inherited by other classes • needs to have 0-argument constructor, e.g., Complex () immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);
Arrays in Java • Arrays in Java are objects • Only 1D arrays are directly supported • Array bounds are checked • Multidimensional arrays as arrays-of-arrays are slow
Multidimensional Arrays in Titanium • New kind of multidimensional array added • Two arrays may overlap (unlike Java arrays) • Indexed by Points (tuple of ints) • Constructed over a set of Points, called Domains • RectDomains are special case of domains • Points, Domains and RectDomains are built-in immutable classes • Support for adaptive meshes and other mesh/grid operations RectDomain<2> d = [0:n,0:n]; Point<2> p = [1, 2]; double [2d] a = new double [d]; a[0,0] = a[9,9];
Naïve MatMul with Titanium Arrays public static void matMul(double [2d] a, double [2d] b, double [2d] c) { int n = c.domain().max()[1]; // assumes square for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { for (int k = 0; k < n; k++) { c[i,j] += a[i,k] * b[k,j]; } } } }
Two Performance Issues • In any language, uniprocessor performance is often dominated by memory hierarchy costs • algorithms that are “blocked” for the memory hierarchy (caches and registers) can be much faster • In Titanium, the representation of arrays is fast, but the access methods are expensive • need optimizations on Titanium arrays • common subexpression elimination • eliminate (or hoist) bounds checking • strength reduce: e.g., naïve code has 1 divide per dimension for each array access • See Geoff Pike’s work • goal: competitive with C/Fortran performance or better
Matrix Multiply (blocked, or tiled) Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N {read block C(i,j) into fast memory} for k = 1 to N {read block A(i,k) into fast memory} {read block B(k,j) into fast memory} C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} A(i,k) C(i,j) C(i,j) = + * B(k,j)
Memory Hierarchy Optimizations: MatMul Speed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops
Unordered iteration • Often useful to reorder iterations for caches • Compilers can do this for simple operations, e.g., matrix multiply, but hard in general • Titanium adds unordered iteration on rectangular domains foreach (p within r) { ….. } • p is a Point new point, scoped only within the foreach body • r is a previously-declared RectDomain • Foreach simplifies bounds checking as well • Additional operations on domains and arrays to subset and transform
Better MatMul with Titanium Arrays public static void matMul(double [2d] a, double [2d] b, double [2d] c) { foreach (ij within c.domain()) { double [1d] aRowi = a.slice(1, ij[1]); double [1d] bColj = b.slice(2, ij[2]); foreach (k within aRowi.domain()) { c[ij] += aRowi[k] * bColj[k]; } } } Current compiler eliminates array overhead, making it comparable to C performance for 3 nested loops Automatic tiling still TBD
Lecture Outline • Language and compiler support for uniprocessor performance • Language support for parallel computation • SPMD execution • Global and local references • Communication • Barriers and single • Synchronized methods and blocks (as in Java) • Analysis of parallel code • Summary and future directions
SPMD Execution Model • Java programs can be run as Titanium, but the result will be that all processors do all the work • E.g., parallel hello world class HelloWorld { public static void main (String [] argv) { System.out.println(“Hello from proc ” Ti.thisProc()); } } • Any non-trivial program will have communication and synchronization between processors
SPMD Execution Model • A common style is compute/communicate • E.g., in each timestep within fish simulation with gravitation attraction read all fish and compute forces on mine Ti.barrier(); write to my fish using new forces Ti.barrier();
SPMD Model • All processor start together and execute same code, but not in lock-step • Sometimes they take different branches if (Ti.thisProc() == 0) { … do setup … } for(all data I own) { … compute on data … } • Common source of bugs is barriers or other global operations inside branches or loops barrier, broadcast, reduction, exchange • A “single” method is one called by all procs public single static void allStep(…) • A “single” variable has the same value on all procs int single timestep = 0;
SPMD Execution Model • Barriers and single in FishSimulation (n-body) class FishSim { public static void main (String [] argv) { int allTimestep = 0; int allEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ read all fish and compute forces on mine Ti.barrier(); write to my fish using new forces Ti.barrier(); } } } • Single methods inferred; see David Gay’s work single single single
lv lv lv lv lv lv gv gv gv gv gv gv Global Address Space • Processes allocate locally • References can be passed to other processes Other processes Process 0 LOCAL HEAP LOCAL HEAP Class C { int val;….. } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val = …; // full … = gv.val; // functionality
Use of Global / Local • Default is global • easier to port shared-memory programs • performance bugs common: global pointers are more expensive • harder to use sequential kernels • Use local declarations in critical sections • Compiler can infer many instances of “local” • See Liblit’s work on LQI (Local Qualification Inference)
Local Pointer Analysis [Liblit, Aiken] • Global references simplify programming, but incur overhead, even when data is local • Split-C therefore requires global pointers be declared explicitly • Titanium pointers global by default: easier, better portability • Automatic “local qualification” inference
Parallel performance • Speedup on Ultrasparc SMP • AMR largely limited by • current algorithm • problem size • 2 levels, with top one serial • Not yet optimized with “local” for distributed memory
Lecture Outline • Language and compiler support for uniprocessor performance • Language support for parallel computation • Analysis and Optimization of parallel code • Tolerate network latency: Split-C experience • Hardware trends and reordering • Semantics: sequential consistency • Cycle detection: parallel dependence analysis • Synchronization analysis: parallel flow analysis • Summary and future directions
Split-C Experience: Latency Overlap • Titanium borrowed ideas from Split-C • global address space • SPMD parallelism • But, Split-C had non-blocking accesses built in to tolerate network latency on remote read/write • Also one-way communication • Conclusion: useful, but complicated int *global p; x := *p; /* get */ *p := 3; /* put */ sync; /* wait for my puts/gets */ *p :- x; /* store */ all_store_sync; /* wait globally */
Other sources of Overlap • Would like compiler to introduce put/get/store. • Hardware also reorders • out-of-order execution • write buffered with read by-pass • non-FIFO write buffers • weak memory models in general • Software already reorders too • register allocation • any code motion • System provides enforcement primitives • e.g., memory fence, volatile, etc. • tend to be heavy wait and with unpredictable performance • Can the compiler hide all this?
x = expr1; y = expr2; Semantics: Sequential Consistency • When compiling sequential programs: Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, not sufficient test. y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag==1); flag = 1; ….. = …..data…..;
write data read flag write flag read data Cycle Detection: Dependence Analog • Processors define a “program order” on accesses from the same thread P is the union of these total orders • Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) • A violation of sequential consistency is cycle in P U A. • Intuition: time cannot flow backwards.
Cycle Detection • Generalizes to arbitrary numbers of variables and processors • Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor [Sasha & Snir] write x write y read y read y write x
Static Analysis for Cycle Detection • Approximate P by the control flow graph • Approximate A by undirected “dependence” edges • Let the “delay set” D be all edges from P that are part of a minimal cycle • The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code) • Synchronization analsysis also critical [Krishnamurthy] write z read x write y read x read y write z
Time (normalized) Automatic Communication Optimization • Implemented in subset of C with limited pointers [Krishnamurthy, Yelick] • Experiments on the NOW; 3 synchronization styles • Future: pointer analysis and optimizations for AMR [Jeh, Yelick]
Other Language Extensions Java extensions for expressiveness & performance • Operator overloading • Zone-based memory management • Foreign function interface The following is not yet implemented in the compiler • Parameterized types (aka templates)
Implementation • Strategy • compile Titanium into C • Solaris or Posix threads for SMPs • Active Messages (Split-C library) for communication • Status • runs on SUN Enterprise 8-way SMP • runs on Berkeley NOW • runs on the Tera (not fully tested) • T3E port partially working • SP2 port under way
Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW, Tera, t3e, SMPs, SP2 (under way). • Application developments ongoing. • Lots of research opportunities.
Future Directions • Super optimizers for targeted kernels • e.g., Phipack, Sparsity, FFTW, and Atlas • include feedback and some runtime information • New application domains • unstructured grids (aka graphs and sparse matrices) • I/O-intensive applications such as information retrieval • Optimizing I/O as well as communication • uniform treatment of memory hierarchy optimizations • Performance heterogeneity from the hardware • related to dynamic load balancing in software • Reasoning about parallel code • correctness analysis: race condition and synchronization analysis • better analysis: aliases and threads • Java memory model and hiding the hardware model
Point, RectDomain, Arrays in General • Points specified by a tuple of ints • RectDomains given by: • lower bound point • upper bound point • stride point • Array given by RectDomain and element type Point<2> lb = [1, 1]; Point<2> ub = [10, 20]; RectDomain<2> R = [lb : ub : [2, 2]]; double [2d] A = new double[r]; ... foreach (p in A.domain()) { A[p] = B[2 * p + [1, 1]]; }
AMR Poisson • Poisson Solver [Semenzato, Pike, Colella] • 3D AMR • finite domain • variable coefficients • multigrid across levels • Performance of Titanium implementation • Sequential multigrid performance +/- 20% of Fortran • On fixed, well-balanced problem of 8 patches, 723 parallel speedups of 5.5 on 8 processors
Distributed Data Structures • Build distributed data structures: • broadcast or exchange RectDomain <1> single allProcs = [0:Ti.numProcs-1]; RectDomain <1> myFishDomain = [0:myFishCount-1]; Fish [1d] single [1d] allFish = new Fish [allProcs][1d]; Fish [1d] myFish = new Fish [myFishDomain]; allFish.exchage(myFish); • Now each processor has an array of global pointers, one to each processors chunk of fish
Consistency Model • Titanium adopts the Java memory consistency model • Roughly: Access to shared variables that are not synchronized have undefined behavior. • Use synchronization to control access to shared variables. • barriers • synchronized methods and blocks
Example: Domain • Domains in general are not rectangular • Built using set operations • union, + • intersection, * • difference, - • Example is red-black algorithm r (6, 4) (0, 0) r + [1, 1] (7, 5) Point<2> lb = [0, 0]; Point<2> ub = [6, 4]; RectDomain<2> r = [lb : ub : [2, 2]]; … Domain<2> red = r + (r + [1, 1]); foreach (p in red) { ... } (1, 1) red (7, 5) (0, 0)
Example using Domains and foreach • Gauss-Seidel red-black computation in multigrid void gsrb() { boundary (phi); for (domain<2> d = res; d != null; d = (d == red ? black : null)) { foreach (q in d) res[q] = ((phi[n(q)] + phi[s(q)] + phi[e(q)] + phi[w(q)])*4 + (phi[ne(q) + phi[nw(q)] + phi[se(q)] + phi[sw(q)]) - 20.0*phi[q] - k*rhs[q]) * 0.05; foreach (q in d) phi[q] += res[q]; } } unordered iteration
Applications • Three-D AMR Poisson Solver (AMR3D) • block-structured grids • 2000 line program • algorithm not yet fully implemented in other languages • tests performance and effectiveness of language features • Other 2D Poisson Solvers (under development) • infinite domains • based on method of local corrections • Three-D Electromagnetic Waves (EM3D) • unstructured grids • Several smaller benchmarks