An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University

Sequential Fortran program + data partitioning Partition computation Insert comm / sync Manage storage Same answers as Fortran program HPF Program Compilation Parallel Machine High-Performance Fortran (HPF) • Industry-standard data parallel language • Partitioning of data drives partitioning of computation, …

Motivation Obtaining high performance from applications written using high-level parallel languages has been elusive • Tightly-coupled applications are particularly hard • Data dependences serialize computation • induces tradeoffs between parallelism, communication granularity and frequency • traditional HPF partitionings limit scalability and performance • Communication might be needed inside loops

Contributions • A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications • An analysis of their performance impact

dHPF Compiler • Based on an abstract equational framework • manipulates sets of processors, array elements, iterations and pairwise mappings between these sets • optimizations and code generation are implemented as operations on these sets and mappings • Sophisticated computation partitioning model • enables partial replication of computation to reduce communication • Support for the multipartitioning distribution • MULTI distribution specifier • suited for line-sweep computations • Innovative optimizations • reduce communication • improve locality

Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Interprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions

Line-Sweep Computations • 1D recurrences on a multidimensional domain • Recurrences order computation along each dimension • Compiler based parallelization is hard: loop carried dependences, fine-grained parallelism

Local Sweeps along x and z Local Sweep along y Transpose Transpose back Partitioning Choices (Transpose)

Partitioning Choices (block + CGP) • Partial wavefront-type parallelism Processor 0 Processor 1 Processor 2 Processor 3

Partitioning Choices (multipartitioning) • Full parallelism for sweeping along any partitioned dimension Processor 0 Processor 1 Processor 2 Processor 3

NAS SP & BT Benchmarks • NAS SP & BT benchmarks from NASA Ames • use ADI to solve the Navier-Stokes equation in 3D • forward & backward line sweeps on each dimension, for each time step • SP solves scalar penta-diagonal systems • BT solves block-tridiagonal systems • SP has double communication volume and frequency

Experimental Setup • 2 versions from NASA, each written in Fortran 77 • parallel MPI hand-coded version • sequential version (3500 lines) • dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase) • Inlined several procedures manually: • enables dHPF to overlap local computation with communication without interprocedural tiling • Platform: SGI Origin 2000 (128 250 MHz procs.), SGI’s MPI implementation, SGI’s compilers

Performance Comparison Compare four versions of NAS SP & BT • Multipartitioned MPI hand-coded version from NASA • different executables for each number of processors • Multipartitioned dHPF-generated version • single executable for all numbers of processors • Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition) • single executable for all numbers of processors • Block-partitioned pghpf-compiled version from PGI’s source code (using a full transpose with a 1D partition) • single executable for all numbers of processors

Efficiency for NAS SP (1023 ‘B’ size) similar comm. volume, more serialization > 2x multipartitioning comm. volume

Efficiency for NAS BT (1023 ‘B’ size) > 2x multipartitioning comm. volume

Overview • Introduction • Line Sweep Computations • Performance Comparison • Optimization Evaluation • Partially Replicated Computation • Inteprocedural Communication Elimination • Communication Coalescing • Direct Access Buffers • Conclusions

Evaluation Methodology • All versions are dHPF-generated using multipartitioning • Turn off a particular optimization (“n - 1” approach) • determine overhead without it (% over fully optimized) • Measure its contribution to overall performance • total execution time • total communication volume • L2 data cache misses (where appropriate) • Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)

Partially Replicated Computation SHADOW a(2, 2) SHADOW a(2, 2) ON_HOME a(i-2, j) È ON_HOME a(i+2, j) È ON_HOME a(i, j-2) È ON_HOME a(i-1, j+1) È ON_HOME a(i, j) ON_EXT_HOME a(i, j) • Partial computation replication is used to reduce communication

Impact of Partial Replication • BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> • Both: eliminate comm. for six 3D arrays in compute_rhs

Impact of Partial Replication (cont.)

Interprocedural Communication Reduction Extensions to HPF/JA Directives • REFLECT: placement of near-neighbor communication • LOCAL: communication not needed for a scope • extended ONHOME: partial computation replication • Compiler doesn’t need full interprocedural communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh

Interprocedural Communication Reduction (cont.) From top neighbor From left neighbor SHADOW a(2, 1) REFLECT (a(0:0, 1:0), a(1:0, 0:0)) SHADOW a(2, 1) REFLECT (a) • The combination of REFLECT, extended ONHOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

Normalizing Communication do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2) Same non-local data needed P0 P1 P0 P1 a(i, j + 2) a(i, j) a(i, j) a(i, j - 2)

Coalescing Communication Coalesced Message A A

Impact of Normalized Coalescing

Impact of Normalized Coalescing Key optimization for scalability

Direct Access Buffers Choices for receiving complex coalesced messages • Unpack them into the shadow regions • two simultaneous live copies in cache • unpacking can be costly • uniform access to non-local & local data • Reference them directly out of the receive buffer • introduces two modes of access for data (non-local & interior) • overhead of having a single loop with these two modes is high • loops should be split into non-local & interior portions, according to the data they reference

Impact of Direct Access Buffers • Use direct access buffers for the main swept arrays • Direct access buffers + loop splitting reduces L2 data cache misses by ~11%, resulting in a reduction of ~11% in execution time

Conclusions • Compiler-generated code can match the performance of sophisticated hand-coded parallelizations • High performance comes from the aggregate benefit of multiple optimizations • Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed • Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

Efficiency for NAS SP (‘A’)

Efficiency for NAS BT (‘A’)

Data Partitioning

Data Partitioning (cont.)

Partially Replicated Computation Local portion A + Shadow Regions Local portion A + Shadow Regions Replicated Computation Local portion U + Shadow Regions Local portions U/B + Shadow Regions Communication Processor p Processor p + 1 do i = 1, n do j = 2, n a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) È ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)

Using HFP/JA for Comm. Elimination

Normalized Comm. Coalescing (cont.) do timestep = 1, T do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddo enddo Coalesce communication at this point

Impact of Direct Access Buffers

Direct Access Buffers Pack, Send, Receive & Unpack Processor 1 Processor 0

Direct Access Buffers Pack, Send & Receive Use Processor 1 Processor 0

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Presentation Transcript

FlexCC2 : An Optimizing Retargetable C Compiler for DSP Applications

Re-engineering Applications for Data-Parallel Hardware

Evaluation Data Compiler

Data-parallel Abstractions for Irregular Applications

New (Applications of) Compiler Techniques for Data Grids

Paraprox : Pattern-Based Approximation for Data Parallel Applications

Compiler Support for Superscalar Processors

Collecting Highly Parallel Data for Paraphrase Evaluation

An Evaluation of Protocols for UAV Science Applications

Compiler Support for Multithreaded Software

Scheduling Parameter Sweep Applications

Line Sweep System Automating your Line Sweep Reporting tasks

Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding

On-Line Evaluation of Graduate Applications

Component-Based Parallel Meshing Support for Accelerator Applications

Knowledge Support for Mining Parallel Performance Data

New (Applications of) Compiler Techniques for Data Grids

An Evaluation of Partitioners for Parallel SAMR Applications

Evaluation of applications over an intranet

System Support for Data-Intensive Applications

Knowledge Support for Mining Parallel Performance Data

System Support for Data-Intensive Applications