410 likes | 517 Views
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications. Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University. Sequential Fortran program + data partitioning. Partition computation Insert comm / sync Manage storage.
E N D
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University
Sequential Fortran program + data partitioning Partition computation Insert comm / sync Manage storage Same answers as Fortran program HPF Program Compilation Parallel Machine High-Performance Fortran (HPF) • Industry-standard data parallel language • Partitioning of data drives partitioning of computation, …
Motivation Obtaining high performance from applications written using high-level parallel languages has been elusive • Tightly-coupled applications are particularly hard • Data dependences serialize computation • induces tradeoffs between parallelism, communication granularity and frequency • traditional HPF partitionings limit scalability and performance • Communication might be needed inside loops
Contributions • A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications • An analysis of their performance impact
dHPF Compiler • Based on an abstract equational framework • manipulates sets of processors, array elements, iterations and pairwise mappings between these sets • optimizations and code generation are implemented as operations on these sets and mappings • Sophisticated computation partitioning model • enables partial replication of computation to reduce communication • Support for the multipartitioning distribution • MULTI distribution specifier • suited for line-sweep computations • Innovative optimizations • reduce communication • improve locality
Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Interprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions
Line-Sweep Computations • 1D recurrences on a multidimensional domain • Recurrences order computation along each dimension • Compiler based parallelization is hard: loop carried dependences, fine-grained parallelism
Local Sweeps along x and z Local Sweep along y Transpose Transpose back Partitioning Choices (Transpose)
Partitioning Choices (block + CGP) • Partial wavefront-type parallelism Processor 0 Processor 1 Processor 2 Processor 3
Partitioning Choices (multipartitioning) • Full parallelism for sweeping along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3
NAS SP & BT Benchmarks • NAS SP & BT benchmarks from NASA Ames • use ADI to solve the Navier-Stokes equation in 3D • forward & backward line sweeps on each dimension, for each time step • SP solves scalar penta-diagonal systems • BT solves block-tridiagonal systems • SP has double communication volume and frequency
Experimental Setup • 2 versions from NASA, each written in Fortran 77 • parallel MPI hand-coded version • sequential version (3500 lines) • dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase) • Inlined several procedures manually: • enables dHPF to overlap local computation with communication without interprocedural tiling • Platform: SGI Origin 2000 (128 250 MHz procs.), SGI’s MPI implementation, SGI’s compilers
Performance Comparison Compare four versions of NAS SP & BT • Multipartitioned MPI hand-coded version from NASA • different executables for each number of processors • Multipartitioned dHPF-generated version • single executable for all numbers of processors • Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition) • single executable for all numbers of processors • Block-partitioned pghpf-compiled version from PGI’s source code (using a full transpose with a 1D partition) • single executable for all numbers of processors
Efficiency for NAS SP (1023 ‘B’ size) similar comm. volume, more serialization > 2x multipartitioning comm. volume
Efficiency for NAS BT (1023 ‘B’ size) > 2x multipartitioning comm. volume
Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Inteprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions
Evaluation Methodology • All versions are dHPF-generated using multipartitioning • Turn off a particular optimization (“n - 1” approach) • determine overhead without it (% over fully optimized) • Measure its contribution to overall performance • total execution time • total communication volume • L2 data cache misses (where appropriate) • Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)
Partially Replicated Computation SHADOW a(2, 2) SHADOW a(2, 2) ON_HOME a(i-2, j) È ON_HOME a(i+2, j) È ON_HOME a(i, j-2) È ON_HOME a(i-1, j+1) È ON_HOME a(i, j) ON_EXT_HOME a(i, j) • Partial computation replication is used to reduce communication
Impact of Partial Replication • BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> • Both: eliminate comm. for six 3D arrays in compute_rhs
Interprocedural Communication Reduction Extensions to HPF/JA Directives • REFLECT: placement of near-neighbor communication • LOCAL: communication not needed for a scope • extended ONHOME: partial computation replication • Compiler doesn’t need full interprocedural communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh
Interprocedural Communication Reduction (cont.) From top neighbor From left neighbor SHADOW a(2, 1) REFLECT (a(0:0, 1:0), a(1:0, 0:0)) SHADOW a(2, 1) REFLECT (a) • The combination of REFLECT, extended ONHOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time
Normalizing Communication do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2) Same non-local data needed P0 P1 P0 P1 a(i, j + 2) a(i, j) a(i, j) a(i, j - 2)
Coalescing Communication Coalesced Message A A
Impact of Normalized Coalescing Key optimization for scalability
Direct Access Buffers Choices for receiving complex coalesced messages • Unpack them into the shadow regions • two simultaneous live copies in cache • unpacking can be costly • uniform access to non-local & local data • Reference them directly out of the receive buffer • introduces two modes of access for data (non-local & interior) • overhead of having a single loop with these two modes is high • loops should be split into non-local & interior portions, according to the data they reference
Impact of Direct Access Buffers • Use direct access buffers for the main swept arrays • Direct access buffers + loop splitting reduces L2 data cache misses by ~11%, resulting in a reduction of ~11% in execution time
Conclusions • Compiler-generated code can match the performance of sophisticated hand-coded parallelizations • High performance comes from the aggregate benefit of multiple optimizations • Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed • Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand
Partially Replicated Computation Local portion A + Shadow Regions Local portion A + Shadow Regions Replicated Computation Local portion U + Shadow Regions Local portions U/B + Shadow Regions Communication Processor p Processor p + 1 do i = 1, n do j = 2, n a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)
Normalized Comm. Coalescing (cont.) do timestep = 1, T do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddo enddo Coalesce communication at this point
Direct Access Buffers Pack, Send, Receive & Unpack Processor 1 Processor 0
Direct Access Buffers Pack, Send & Receive Use Processor 1 Processor 0