Code Optimization of Parallel Programs

Code Optimization of Parallel Programs Vivek Sarkar Rice University vsarkar@rice.edu

Parallel Software Challenges & Focus Area for this Talk Domain-specific implicitly parallel programming models e.g., Matlab, stream processing, map-reduce (Sawzall), Domain-specific Programming Models Parallelism in middleware e.g., transactions, relational databases, web services, J2EE containers Middleware Parallel application libraries e.g., linear algebra, graphics imaging, signal processing, security Application Libraries Parallel Debugging and Performance Tools e.g., Eclipse Parallel Tools Platform, TotalView, Thread Checker Programming Tools Explicitly parallel languages e.g., OpenMP, Java Concurrency, .NET Parallel Extensions, Intel TBB, CUDA, Cilk, MPI, Unified Parallel C, Co-Array Fortran, X10, Chapel, Fortress Languages Static & Dynamic Optimizing Compilers Parallel intermediate representation, optimization of synchronization & data transfer, automatic parallelization Code partitioning for accelerators, data transfer optimizations, SIMDization, space-time scheduling, power management Multicore Back-ends Parallel Runtime & System Libraries Parallel runtime and system libraries for task scheduling, synchronization, parallel data structures OS and Hypervisors Virtualization, scalable management of heterogeneous resources per core (frequency, power)

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches to Code Optimization of Parallel Code Rice Habanero Multicore Software project Outline

Our Current Paradigm for Code Optimization has served us well for Fifty Years …. ALPHA Autocoder II Fortran Translation Translation Translation IL OPTIMIZER IL Stretch – Harvest Compiler Organization (1958 - 1962) Source: “Compiling for Parallelism”, Fran Allen, Turning Lecture, June 2007 REGISTER ALLOCATOR IL ASSEMBLER OBJECT CODE STRETCH STRETCH-HARVEST

Interprocedural analysis Array dependence analysis Pointer alias analysis Instruction scheduling & software pipelining SSA form Profile-directed optimization Dynamic compilation Adaptive optimization Auto-tuning . . . … and has been adapted to meet challenges along the way …

Proliferation of parallel hardware Multicore, manycore, accelerators, clusters, … Proliferation of parallel libraries and languages OpenMP, Java Concurrency, .NET Parallel Extensions, Intel TBB, Cilk, MPI, UPC, CAF, X10, Chapel, Fortress, … … but is now under siege because of parallelism

"The Structure of Scientific Revolutions”, Thomas S. Kuhn (1970) A paradigm is a scientific structure or framework consisting of Assumptions, Laws, Techniques Normal science is a puzzle solving activity governed by the rules of the paradigm. It is uncritical of the current paradigm, Crisis sets in when a series of serious anomalies appear “The emergence of new theories is generally preceded by a period of pronounced professional insecurity” Scientists engage in philosophical and metaphysical disputes. A revolution or paradigm shift occurs when an an entire paradigm is replaced by another Paradigm Shifts

Kuhn’s History of Science Immature Science Revolution Normal Science Crisis Anomalies Revolution: A new paradigm emerges Old Theory: well established, many followers, many anomalies New Theory: few followers, untested, new concepts/techniques, accounts for anomalies, asks new questions Source: www.philosophy.ed.ac.uk/ug_study/ ug_phil_sci1h/phil_sci_files/L10_Kuhn1.ppt

Newton’s Laws to Einstein's Theory of Relativity Ptolemy’s geocentric view to Copernicus and Galileo’s heliocentric view Creationism to Darwin’s Theory of Evolution Some Well Known Paradigm Shifts

Paradigm Shifts Anomalies in Optimizing Parallel Code Incremental vs. Comprehensive Approaches Rice Habanero Multicore Software project Outline

Examples Control flow rules Data flow rules Load elimination rules What anomalies do we see when optimizing parallel code?

Control Flow Graph Node = Basic Block Edge = Transfer of Control Flow Succ(b) = successors of block b Pred(b) = predecessors of block b Dominators Block d dominates block b if every (sequential) path from START to b includes d Dom(b) = set of dominators of block b Every block has a unique immediate dominator (parent in dominator tree) 1. Control Flow Rules from Sequential Code Optimization

Dominator Tree Dominator Example START Control Flow Graph BB1 START BB2 BB3 BB4 BB1 T F STOP BB2 BB3 BB4 STOP

BB1 parbegin BB2 || BB3 parend BB4 Anomalies in Control Flow Rules for Parallel Code Parallel Control Flow Graph BB1 FORK BB2 BB3 JOIN Does B4 have a unique immediate dominator? Can the dominator relation be represented as a tree? BB4

Example: Reaching Definitions REACHin(n) = set of definitions d s.t. there is a (sequential) path from d to n in the CFG, and d is not killed along that path. 2. Data Flow Rules from Sequential Code Optimization

control sync Anomalies in Data Flow Rules for Parallel Code S1: X1 := … parbegin // Task 1 S2: X2 := … post(ev2); S3: . . . post(ev3); S4: wait(ev8); X4 := … || // Task 2 S5: . . . S6: wait(ev2); S7: X7 := … S8: wait(ev3); post(ev8); parend . . . What definitions reach COEND? What if there were no synchronization edges? How should the data flow equations be defined for parallel code?

A load instruction at point P, T3 := *q, is redundant, if the value of *q is available at point P 3. Load Elimination Rules from Sequential Code Optimization T1 := *q T2 := *p T3 := *q T1 := *q T2 := *p T3 := T1

TASK 1 . . . T1 := *q T2 := *p T3 := *q print T1, T2, T3 Anomalies in Load Elimination Rules for Parallel Code(Original Version) Assume that p = q, and that *p = *q = 0 initially. TASK 2 . . . *p = 1 . . . Question: Is [0, 1, 0] permitted as a possible output? Answer: It depends on the programming model. It is not permitted by Sequential Consistency [Lamport 1979] But it is permitted by Location Consistency [Gao & Sarkar 1993, 2000]

TASK 1 . . . T1 := *q T2 := *p T3 := T1 print T1, T2, T3 Anomalies in Load Elimination Rules for Parallel Code(After Load Elimination) Assume that p = q, and that *p = *q = 0 initially. TASK 2 . . . *p = 1 . . . Question: Is [0, 1, 0] permitted as a possible output? Answer: Yes, it will be permitted by Sequential Consistency, if load elimination is performed!

Large investment in infrastructures for sequential code optimization Introduce ad hoc rules to incrementally extend them for parallel code optimization Code motion fences at sycnhronization operations Task creation and termination via function call interfaces Use of volatile storage modifiers . . . Incremental Approaches to coping with Parallel Code Optimization

Need for a new Parallel Intermediate Representation (PIR) with robust support for code optimization of parallel programs Abstract execution model for PIR Storage classes (types) for locality and memory hierarchies General framework for task partitioning and code motion in parallel code Compiler-friendly memory model Combining automatic parallelization and explicit parallelism . . . More Comprehensive Changes will be needed for Code Optimization of Parallel Programs in the Future

A Program Dependence Graph, PDG = (N', Ecd, Edd) is derived from a CFG and consists of: Program Dependence Graphs [Ferrante, Ottenstein, Warren 1987]

/* S1 */ max = a[i]; /* S2 */ div = a[i] / b[i] ; /* S3 */ if ( max < b[i] ) /* S4 */ max = b[i] ; PDG Example max (anti) S1 S2 S3 max (true) S4 max (output)

Control Dependence Predicate-ancestor condition: if there are two disjoint c.d. paths from (ancestor) node A to node N, then A cannot be a region node i.e., A must be a predicate node No-postdominating-descendant condition: if node P postdominates node N in the CFG, then there cannot be a c.d. path from node N to node P PDG restrictions

Node 4 is executed twice in this acyclic PDG Violation of the Predecessor-Ancestor Condition can lead to “non-serializable” PDGs [LCPC 1993] “Parallel Program Graphs and their Classification”, V.Sarkar & B.Simons, LCPC 1993

Data Dependence There cannot be a data dependence edge in the PDG from node A to node B if there is no path from A to B in the CFG The context C of a data dependence edge (A,B,C) must be plausible i.e., it cannot identify a dependence from an execution instance IA of node A to an execution instance IB of node B if IB precedes IA in the CFG's execution e.g., a data dependence from iteration i+1 to iteration i is not plausible in a sequential program PDG restrictions (contd)

PDGs and CFGs are tightly coupled A transformation in one must be reflected in the other PDGs reveal maximum parallelism in the program CFGs reveal sequential execution Neither is well suited for code optimization of parallel programs e.g., how do we represent a partitioning of { 1, 3, 4 } and { 2 } into two tasks? Limitations of Program Dependence Graphs

Another Limitation: no Parallel Execution Semantics defined for PDGs A[f(i,j)] = … … = A[g(i)] • What is the semantics of control dependence edges with cycles? • What is the semantics of data dependences when a source or destination node may have zero, one or more instances?

A Parallel Program Graph, PPG = (N, Econtrol , Esync) consists of: N, a set of compute, predicate, and parallel nodes A parallel node creates parallel threads of computation for each of its successors Econtrol , a set of labeled control edges. Edge (A,B,L) in Econtrol identifies a control edge from node A to node B with label L. Esync , a set of synchronization edges. Edge (A,B,F) in Esync defines a synchronization from node A to node B with synchronization condition F which identifies execution instances of A and B that need to be synchronized Parallel Program Graphs: A Comprehensive Representation that Subsumes CFGs and PDGs [LCPC 1992] “A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs”, V.Sarkar, LCPC 1992

PPG Example

Construction of PPG for a sequential program PPG nodes = CFG nodes PPG control edges = CFG edges PPG synchronization edges = empty set Relating CFGs to PPGs

Construction of PPG for PDGs PPG nodes = PDG nodes PPG parallel nodes = PDG regions nodes PPG control edges = PDG control dependence edges PPG synchronization edges = PDG data dependence edges Synchronization condition F in PPG synchronization edge mirrors context of PDG data dependence edge Relating PDGs to PPGs

Example of Transforming PPGs

Build a partial order  of dynamic execution instances of PPG nodes as PPG execution unravels. Each execution instance IA is labeled with its history (calling context), H(IA). Initialize  to a singleton set containing an instance of the start node, ISTART , with H(ISTART ) initialized to the empty sequence. Abstract Interpreter for PPGs

Each iteration of the scheduling algorithm: Selects an execution instance IA in  such that all of IA's predecessors in  have been scheduled Simulates execution of IA and evaluates branch label L Creates an instance IB of each c.d. successor B of A for label L Adds (IB, IC) to  , if instance IC has been created in  and there exists a PPG synchronization edge from B to C (or from a PPG descendant of B to C) Adds (IC, IB) to  , if instance IC has been created in  and there exists a PPG synchronization edge from C to B (or from a PPG descendant of C to B) Abstract Interpreter for PPGs (contd)

Create ISTART Schedule ISTART Create IPAR Schedule IPAR Create I1, I2, I3 Add (I1, I3) to  Schedule I2 Schedule I1 Schedule I3 . . . Abstract Interpreter for PPGs: Example

All memory accesses are assumed to be non-atomic Read-write hazard --- if Ia reads a location for which there is a parallel write of a different value, then the execution result is an error Analogous to an exception thrown if a data race occurs May be thrown when read or write operation is performed Write-write hazard --- if Ia writes into a location for which there is a parallel write of a different value, then the resulting value in the location is undefined Execution results in an error if that location is subsequently read Separation of data communication and synchronization: Data communication specified by read/write operations Sequencing specified by synchronization and control edges Weak (Deterministic) Memory Model for PPGs

Reordering Theorem For a given Parallel Program Graph, G, and input store, i, the final store f = G(i) obtained is the same for all possible scheduled sequences in the abstract interpreter Equivalence Theorem A sequential program and its PDG have identical semantics i.e., they yield the same output store when executed with the same input store Soundness Properties

Reaching Definitions Analysis on PPGs [LCPC 1997] A definition D is redefined at program point P if there is a control path from D to P, and D is killed along all paths from D to P. “Analysis and Optimization of Explicitly Parallel Programs using the Parallel Program Graph Representation”, V.Sarkar, LCPC 1997

control sync Reaching Definitions Analysis on PPGs S1: X1 := … // Task 1 S2: X2 := … post(ev2); S3: . . . post(ev3); S4: wait(ev8); X4 := … // Task 2 S5: . . . S6: wait(ev2); S7: X7 := … S8: wait(ev3); post(ev8);

Past work has focused on comprehensive representation and semantics for deterministic programs Extensions needed for Atomicity and mutual exclusion Stronger memory models Storage classes with explicit locality PPG Limitations

Questions: Can the load of p.x be moved below the store of q.y? Can the load of p.x be moved outside the synchronized block? Can the load of r.z be moved inside the synchronized block? Can the load of r.z be moved back outside the synchronized block? How should the data dependences be modeled? Issues in Modeling Synchronized/Atomic Blocks[LCPC 1999] a = ... synchronized (L) { ... = p.x q.y = ... b = } ... = r.z “Dependence Analysis for Java”, C.Chambers et al, LCPC 1999

Habanero Project (habanero.rice.edu) Parallel Applications Foreign Function Interface 1) Habanero Programming Language Sequential C, Fortran, Java, … X10 … 2) Habanero Static Compiler 3) Habanero Virtual Machine 4) Habanero Concurrency Library 5) Habanero Toolkit Vendor Compiler & Libraries Multicore Hardware

2) Habanero Static Parallelizing & Optimizing Compiler Foreign Function Interface Sequential C, Fortran, Java, … X10/Habanero Language Front End Interprocedural Analysis AST IRGen PIR Analysis & Optimization Parallel IR (PIR) Classfile Transformations Annotated Classfiles C / Fortran (restricted code regions for targeting accelerators & high-end computing) Partitioned Code Portable Managed Runtime Platform-specific static compiler

Applications: Parallel Benchmarks SSCA’s #1, #2, #3 from DARPA HPCS program NAS Parallel Benchmarks JGF, JUC, SciMark benchmarks Medical Imaging Back-end processing for Compressive Sensing (www.dsp.ece.rice.edu/cs) Contacts: Rich Baraniuk (Rice), Jason Cong (UCLA) Seismic Data Processing Rice Inversion project (www.trip.caam.rice.edu) Contact: Bill Symes (Rice), James Gunning (CSIRO) Computer Graphics and Visualization Mathematical modeling and smoothing of meshes Contact: Joe Warren (Rice) Computational Chemistry Fock Matrix Construction Contacts: David Bernholdt, Wael Elwasif, Robert Harrison, Annirudha Shet (ORNL) Habanero Compiler Implement Habanero compiler in Habanero so as to exploit multicore parallelism within the compiler Habanero Target Applications and Platforms Platforms: • AMD Barcelona Quad-Core • Clearspeed Advance X620 • DRC Coprocessor Module w/ Xilinx Virtex FPGA • IBM Cell • IBM Cyclops-64 (C-64) • IBM Power5+, Power6 • Intel Xeon Quad-Core • NVIDIA Tesla S870 • Sun UltraSparc T1, T2 • . . . Additional suggestions welcome!

1) Language Research Explicit parallelism: portable constructs for homogeneous & heterogeneous multicore Implicit deterministic parallelism: array views, single-assignment constructs Implicit non-deterministic parallelism: unordered iterators, partially ordered statement blocks Builds on our experiences with the X10, CAF, HPF, Matlab D, Fortran 90 and Sisal languages 2) Compiler research New Parallel Intermediate Representation (PIR) Automatic analysis, transformation, and parallelization of PIR Optimization of high-level arrays and iterators Optimization of synchronization, data transfer, and transactional memory operations Code partitioning for accelerators Builds on our experiences with the D System, Massively Scalar, Telescoping Languages Framework, ASTI and PTRAN research compilers Habanero Research Topics

3) Virtual machine research VM support for work-stealing scheduling algorithms with extensions for places, transactions, task groups Runtime support for other Habanero language constructs (phasers, regions, distributions) Integration and exploitation of lightweight profiling in VM scheduler and memory management system Builds on our experiences with the Jikes Research Virtual Machine 4) Concurrency library research New nonblocking data structures to support the Habanero runtime Efficient software transactional memory libraries Builds on our experiences with the java.util.concurrent and DSTM2 libraries 5) Toolkit research Program analysis for common parallel software errors Performance attribution of shared code regions (loops, procedure calls) using static and dynamic calling context Builds on our experiences with the HPCToolkit, Eclipse PTP and DrJava projects Habanero Research Topics (contd.)

Education Influence how parallelism is taught in future Computer Science curricula Open Source Build an open source testbed to grow ecosystem for researchers in Parallel Software area Industry standards Use research results as proofs of concept for new features that can be standardized Infrastructure can provide foundation for reference implementations Collaborations welcome! Opportunities for Broader Impact

Code Optimization of Parallel Programs

Code Optimization of Parallel Programs

Presentation Transcript

Code Optimization

Code Optimization

C66x Code Optimization

Performance of Parallel Programs

Parallel Code Choices

Correctness of parallel programs

Benchmarking Parallel Code

Analytical Modeling of Parallel Programs

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code

Parallel Programs

Designing Parallel Programs

Code Optimization

Parallel Query Optimization

Evaluating Parallel Programs

Code Optimization

Sparse code optimization

Code Optimization

Code Optimization

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code

Designing Parallel Programs

Code Optimization

Code Optimization