450 likes | 677 Views
Outline. Co-array Fortran language recapCompilation approachCo-array storage managementCommunicationA preliminary performance studyPlatformsBenchmarks and results and lessonsLanguage refinement issuesConclusions. CAF Language Assessment. Strengths offloads communication management to the compilerchoreographing data transfermanaging mechanics of synchronizationgives user full control of parallelizationdata movement and synchronization as language primitivesamenable to compiler opti32354
E N D
1. Co-array Fortran: Compilation, Performance, Languages Issues John Mellor-Crummey
Cristian Coarfa Yuri Dotsenko
Department of Computer Science
Rice University
2. Outline Co-array Fortran language recap
Compilation approach
Co-array storage management
Communication
A preliminary performance study
Platforms
Benchmarks and results and lessons
Language refinement issues
Conclusions
3. CAF Language Assessment Strengths
offloads communication management to the compiler
choreographing data transfer
managing mechanics of synchronization
gives user full control of parallelization
data movement and synchronization as language primitives
amenable to compiler optimization
array syntax supports natural user-level vectorization
modest compiler technology can yield good performance
more abstract than MPI ? better performance portability
Weaknesses
user manages partitioning of work
user specifies data movement
user codes necessary synchronization
4. Compiler Goals Portable compiler
Multi-platform code generation
High performance generated code
5. Compilation Approach Source-to-source Translation
Translate CAF into Fortran 90 + communication calls
One-sided communication layer
strided communication
gather/scatter
synchronization: barriers, notify/wait
split phase non-blocking primitives
Today: ARMCI: remote memory copy interface (Nieplocha @ PNL)
Benefits
wide portability
leverage vendor F90 compilers for good node performance
6. Co-array Data Co-array representation
F90 pointer to data + opaque handle for communication layer
Co-array access
read/write local co-array data using F90 pointer dereference
remote accesses translate into ARMCI GET/PUT calls
Co-array allocation
storage allocation by communication layer, as appropriate
on shared memory hardware: in shared memory segment
on Myrinet: in pinned memory for direct DMA access
dope vector initialization using CHASM (Rasmussen @ LANL)
set F90 pointer to point to externally managed memory Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance
Develop compiler analysis and code generation technology to choreograph communication and computation in parallel programs to deliver maximum performance
7. Allocating Static Co-arrays (COMMON/SAVE) Compiler:
generate static initializer for each common/save variable
Linker
collect calls to all initializers
generate global initializer that calls all others
compile global initializer and link into program
Launch
call global initializer before main program begins
8. COMMON Block Sequence Association Problem
each procedure may have a different view of a common
Solution
allocate a contiguous pool of co-array storage per common
each procedure has a private set of view variables (F90 pointers)
initialize all per procedure view variables
only once at launch after common allocation
9. Porting to a new Compiler / Architecture Synthesize dope vectors for co-array storage
compiler/architecture specific details: CHASM library
Tailor communication to architecture
design supports alternate communication libraries
status
today: ARMCI (PNL)
ongoing work: compiler tailored communication
direct load/store on shared-memory architectures
future
other portable libraries (e.g. GASnet)
custom communication library for an architecture
10. Supporting Multiple Co-dimensions A(:,:)[N,M,*]
Add precomputed coefficients to co-array meta-data
Lower, upper bounds for each co-dimension
this_image_cache for each co-dimension
e.g., this_image(a,1) yields my co-row index
cum_hyperplane_size for each co-dimension
11. Implementing Communication Given a statement
X(1:n) = A(1:n)[p] + …
A temporary buffer is used for off processor data
invoke communication library to allocate tmp in suitable temporary storage
dope vector filled in so tmp can be accessed as F90 pointer
call communication library to fill in tmp (ARMCI GET)
X(1:n) = tmp(1:n) + …
deallocate tmp
13. Supported Features Declarations
co-objects: scalars and arrays
COMMON and SAVE co-objects of primitive types
INTEGER(4), REAL(4) and REAL(8)
COMMON blocks: variables and co-objects intermixed
co-objects with multiple co-dimensions
procedure interface blocks with co-array arguments
Executable code
array section notation for co-array data indices
local and remote co-arrays
co-array argument passing
co-array dummy arguments require explicit interface
co-array pointer + communication handle
co-array reshaping supported
CAF intrinsics
Image inquiry: this_image(…), num_images()
Synchronization: sync_all, sync_team, synch_notify, synch_wait
14. Coming Attractions Allocatable co-arrays
REAL(8), ALLOCATABLE :: X(:)[*]
ALLOCATE(X(MYX_NUM)[*])
Co-arrays of user-defined types
Allocatable co-array components
user defined type with pointer components
Triplets in co-dimensions
A(j,k)[p+1:p+4]
16. A Preliminary Performance Study Platforms
Alpha+Quadrics QSNet (Elan3)
Itanium2+Quadrics QSNet II (Elan4)
Itanium2+Myrinet 2000
Codes
NAS Parallel Benchmarks (NPB) from NASA Ames
17. Alpha+Quadrics Platform (Lemieux) Nodes: 750 Compaq AlphaServer ES35 4-way ES45
1-GHz Alpha EV6.8 (21264C), 64KB/8MB L1/L2 cache
4 GB RAM/node
Interconnect: Quadrics QSNet (Elan3)
340 MB/s peak and 210 MB/s sustained x 2 rails
Operating System: Tru64 Unix5.1A SC2.5
Compiler: HP Fortran Compiler V5.5A
Communication Middleware: ARMCI 1.1-beta
18. Itanium2+Quadrics Platform (PNNL) Nodes: 944 HP Long’s Peak dual-CPU workstations
1.5GHz Itanium2 32KB/256KB/6MB L1/L2/L3 cache
6GB RAM/node
Interconnect: Quadrics QSNet II
905 MB/s
Operating System: Red Hat Linux, 2.4.20
Compiler: Intel Fortran Compiler v7.1
Communication Middleware: ARMCI 1.1-beta
19. Itanium2+Myrinet Platform (Rice) Nodes: 96 HP zx6000 dual-CPU workstations
900MHz Itanium2 32KB/256KB/1.5MB L1/L2/L3 cache
4GB RAM/node
Interconnect: Myrinet 2000
240 MB/s
GM version 1.6.5
MPICH-GM version 1.2.5
Operating System: Red Hat Linux, 2.4.18 + patches
Compiler: Intel Fortran Compiler v7.1
Communication Middleware: ARMCI 1.1-beta
20. NAS Parallel Benchmarks (NPB) 2.3 Benchmarks by NASA Ames
2-3K lines each (Fortran 77)
Widely used to test parallel compiler performance
NAS versions:
NPB2.3b2 : Hand-coded MPI
NPB2.3-serial : Serial code extracted from MPI version
Our version
NPB2.3-CAF: CAF implementation, based on the MPI version Caf version based on the mpi versions
Preserve parallelizationCaf version based on the mpi versions
Preserve parallelization
21. NAS BT Block tridiagonal solve of 3D Navier Stokes
Dense matrix
Parallelization:
alternating line sweeps along 3 dimensions
multipartitioning data distribution for full parallelism
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF communication
strided blocks transferred using vector PUTs (triplet notation)
no user-declared communication buffers
Large messages, relatively infrequent communication
22. NAS BT Efficiency (Class C)
23. NAS SP Scalar pentadiagonal solve of 3D Navier Stokes
Dense matrix
Parallelization:
alternating line sweeps along 3 dimensions
multipartitioning data distribution for full parallelism
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF communication
pack into buffer; separate buffer for each plane of sweep
transfer using PUTs
smaller more frequent messages; 1.5x communication of BT
24. NAS SP Efficiency (Class C)
25. NAS MG 3D Multigrid solver with periodic boundary conditions
Dense matrix
Grid size and levels are compile time constants
Communication
nearest neighbor with possibly 6 neighbors
MPI asynchronous send/receive
CAF
pairwise synch_notify/wait to coordinate with neighbors
four communication buffers (co-arrays) used: 2 sender, 2 receiver
pack and transfer contiguous data using PUTS
for each dimension
notify my neighbors that my buffers are free
wait for my neighbors to notify me their buffers are free
PUT data into right buffer, notify neighbor
PUT data into left buffer, notify neighbor
wait for both to complete
26. NAS MG Efficiency (Class C)
27. NAS LU Solve 3D Navier Stokes using SSOR
Dense matrix
Parallelization on power of 2 processors
repeated decompositions on x and y until all processors assigned
wavefront parallelism; small messages 5 words each
MPI implementation
asynchronous send/receive
communication/computation overlap
CAF
two dimensional co-arrays
morphed code to pack data for higher communication efficiency
uses PUTs
28. NAS LU Efficiency (Class C)
29. NAS CG Conjugant gradient solve to compute eigenvector of large, sparse, symmetric, positive definite matrix
MPI
Irregular point-to-point messaging
CAF: structure follows MPI
Irregular notify/wait
vector assignments for data transfer
No communication/computation overlap for either
30. NAS CG Efficiency (Class C) Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
31. CAF GET vs. PUT Communication Definitions
GET: q_caf(n1:n2) = w(m1:m2)[reduce_exch_proc_noncaf(i)]
PUT: q_caf(n1:n2)[reduce_exch_proc_noncaf(i)] = w(m1:m2)
Study
64 procs, NAS CG class C
Alpha+Quadrics Elan3 (Lemieux)
Performance
GET: 12.9% slower than MPI
PUT: 4.0% slower than MPI
32. Experiments Summary On cluster-based architectures, to achieve best performance with CAF, a user or compiler must
vectorize (and perhaps aggregate) communication
reduce synchronization strength
replace all-to-all with point-to-point where sensible
overlap communication with computation
convert GETS into PUTS where gets are not a h/w primitive
consider memory layout conflicts: co-array vs. regular data
generate code amenable for back-end compiler optimizations
CAF language: many optimizations possible at the source level
Compiler optimizations NECESSARY for portable coding style
might need user hints where synchronization analysis falls short
Runtime issues
on Myrinet pin co-array memory for direct transfers
33. CAF Language Refinement Issues Initial implementations on Cray T3E and X1 led to features not suited for distributed memory platforms
Key problems and solution suggestions
Restrictive memory fence semantics for procedure calls
pragmas to enable programmer to overlap one-sided communication with procedure calls
Overly restrictive synchronization primitives
add unidirectional, point-to-point synchronization
rework team model (next slide)
No collective operations
Leads to home-brew non-portable implementations
add CAF intrinsics for reductions, broadcast, etc.
34. CAF Language Refinement Issues CAF dynamic teams lead to don’t scale
pre-arranged “communicator-like” teams
would help collectives: O(log P) rather than O(P2)
reordering logical numbering of images for topology
add shape information to image teams?
Blocking communication reduces scalability
user mechanisms to delay completion to enable overlap?
Synchronization is not paired with data movement
synchronization hint tags to help analysis
synchronization tags at run-time to track completion?
How relaxed should the memory model be for performance?
35. Conclusions Tuned CAF performance is comparable to tuned MPI
even without compiler-based communication optimizations!
CAF programming model enables source-level optimization
communication vectorization
synchronization strength reduction
achieve performance today rather than waiting for tomorrow’s compilers
CAF is amenable to compiler analysis and optimization
significant communication optimization is feasible, unlike for MPI
optimizing compilers will help a wider range of programs achieve high performance
applications can be tailored to fully exploit architectural characteristics
e.g., shared memory vs. distributed memory vs. hybrid
However, more abstract programming models would simplify code development (e.g. HPF)
36. Project URL
http://www.hipersoft.rice.edu/caf
37. Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
38. Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
39. Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
40. Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
Conjugate gradient solver based on sparse-matrix vector multiply
CAF implementation
Converted two-sided MPI communication into vectorized one-sided communication and calls to notify/wait
41. Parallel Programming Models Goals:
Expressiveness
Ease of use
Performance
Portability
Current models:
OpenMP: difficult to map on distributed memory platforms
HPF: difficult to obtain high-performance on broad range of programs
MPI: de facto standard; hard to program, assumptions about communication granularity are hard coded
UPC: global address space language; similar to CAF but with location transparency
42. Finite Element Example (Numrich) subroutine assemble(start, prin, ghost, neib, x)
integer :: start(:), prin(:), ghost(:), neib(:), k1, k2, p
real :: x(:) [*]
call sync_all(neib)
do p = 1, size(neib) ! Add contributions from neighbors
k1 = start(p); k2 = start(p+1)-1
x(prin(k1:k2)) = x(prin(k1:k2)) + x(ghost(k1:k2)) [neib(p)]
enddo
call sync_all(neib)
do p = 1, size(neib) ! Update the neighbors
k1 = start(p); k2 = start(p+1)-1
x(ghost(k1:k2)) [neib(p)] = x(prin(k1:k2))
enddo
call synch_all
end subroutine assemble This slide presents a gather-scatter operation;
this code is part of an irregular application
what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that’s an example of the language expressivenessThis slide presents a gather-scatter operation;
this code is part of an irregular application
what I want to show is that it can be expressed compactly in co-array fortran (as you can see it all fits on one slide); that’s an example of the language expressiveness
43. Communicating Private Data Example
REAL:: A(100,100)[*], B(100)
A(:,j)[p] = B(:)
Issue
B is a local array
B is sent to a partner
Will require a copy into shared space before transfer
For higher efficiency want B in shared storage
Alternatives
Declare communicated arrays as co-arrays
Add a communicated attribute to B’s declaration
mark it for allocation in shared storage
44. Passing Co-arrays as Arguments Language restriction: pass co-arrays by whole array
REAL :: A(100,100)[*]
CALL FOO(A)
Callee must declare an explicit subroutine interface
Proposed option: F90 assumed shape co-array arguments
Allow passing of Fortran 90 style array sections of local co-array
REAL :: A(100,100)[*]
CALL FOO(A(1:10:2,3:25))
Callee must declare an explicit subroutine interface
If matching dummy argument is declared as a co-array, then
Must declare assumed size data dimensions
Must declare assumed size co-dimensions
Avoids copy-in, copy-out for co-array data
45. Co-array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95
defined by Numrich & Reid
Global address space SPMD parallel programming model
one-sided communication
Simple, two-level memory model for locality management
local vs. remote memory
Programmer control over performance critical decisions
data partitioning
communication
Suitable for mapping to a range of parallel architectures
shared memory, message passing, hybrid, PIM The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by
Numrich & Reid; it proposes a global address space SPMD parallel programming model
with one-sided communication
CAF uses a simple, two-level model that supports locality management; namely it distinguishes between
local and remote memory
In CAF the programmer has control over decisions such as data partitioning and communication
One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range
Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM
Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium
The Co-Array Fortran (abbreviated CAF) language is an Explicitly-parallel extension of Fortran 90/95, developed by
Numrich & Reid; it proposes a global address space SPMD parallel programming model
with one-sided communication
CAF uses a simple, two-level model that supports locality management; namely it distinguishes between
local and remote memory
In CAF the programmer has control over decisions such as data partitioning and communication
One of the goals of CoArray Fortran is portable performance; the language is suitable for a wide range
Of parallel architectures, such as shared memory, message passing, clusters of smps and PIM
Belongs to the same language family as Uinifed Parallel C (UPC) and Titanium
46. CAF Programming Model Features SPMD process images
fixed number of images during execution
images operate asynchronously
Both private and shared data
real x(20, 20) a private 20x20 array in each image
real y(20, 20) [*] a shared 20x20 array in each image
Simple one-sided shared-memory communication
x(:,j:j+2) = y(:,p:p+2) [r] copy columns from p:p+2 into local columns
Synchronization intrinsic functions
sync_all – a barrier and a memory fence
sync_mem – a memory fence
sync_team([notify], [wait])
notify = a vector of process ids to signal
wait = a vector of process ids to wait for, a subset of notify
Pointers and (perhaps asymmetric) dynamic allocation
Parallel I/O A running CAF program consists of a fixed number of process images that operate asynchronously
The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image
Y,by the use of the [], is defined as a shared 20x20 array in each image
Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers
A running CAF program consists of a fixed number of process images that operate asynchronously
The locality of the data is made explicit by the language; x is declared as a private 20x20 array in each image
Y,by the use of the [], is defined as a shared 20x20 array in each image
Communication is realized by means of the co-arrays, using the bracket notation to specify remote image numbers