Loop descriptors and cross-loop techniques : portability , locality and more

Loopdescriptors and cross-looptechniques: portability, locality and more István Z. Reguly, PPCU ITK reguly.istvan@itk.ppke.hu JointworkwithGihanMudalige (Warwick), Mike Giles (Oxford), Paul Kelly (Imperial), Rolls-Royce, Univs of Southampton and Bristol, UCL, and many-many more

Outline • Loop descriptors • Per-loop techniques • Data layout • Auto-parallelisation: shared + distributed memory • Race conditions • Cross-loop techniques • Analysing loop chains • Locality and load balancing • Resiliency and more... Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Parallel loop formulation • We all know the parallel for-each loops – STL, KOKKOS, etc. • Built around the idea of separation of concerns • Semantics: the order of execution doesn’t matter, plus extras: reduction, scan • Some form of iteration space • Various abstractions extend this with memory spaces (GPUs), execution policies (scheduling), layout policies (AoS-SoA) • Plus more complex loop constructs, such as nested parallelism, task DAG • The user can take more responsibility to specify lower-level implementation matters • Almost all of these target shared-memory parallelism only • MPI+X, but MPI is used through an abstraction too! Why not combine them? • What more can we do? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Loop descriptors • Add further information (access-execute abstraction) • How data is accessed: Read/Write/Increment • Access pattern: • Structured meshes: n-point stencil • Unstructured meshes: indirection array: A[map[idx]] OPS OP2 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

User Contract • OPS and OP2 Domain Specific Languages • Contract with the user • For a certain region of code data is only accessed through API calls • No “side-effects” • Implications • Can manipulate data structures (e.g. AoS-SoA) • Can take full responsibility for data movement • Including MPI partitioning and halo communication • Including separate memory spaces (GPUs) • Can delay execution - later Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Auto-parallelisation • The usual auto-parallelisation • For-each, reduction • Fewer implementation details on user • Sophisticated orchestration of parallelism • Common operation in unstructured meshes:indirect scatter increment • Atomics, colouring →two-level colouring • Improved data re-use • 2-3x faster than “global” colouring • Execution “plans” • Computed once, reused Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Auto-parallelisation • Matching performance with hand-coded implementations RR Hydra – Sandy Bridge + K20 CloverLeaf– Sandy Bridge + K20 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Pennycook’s Performance Portability Metric • H is a set of platforms • a is the application • p is the parameters for a • e is the performance efficiency measure S.J. Pennycook, J.D. Sewall, V.W. Lee, Implications of a metric for performance portability, In Future Generation Computer Systems, 2017, doi: 10.1016/j.future.2017.08.007 TeaLeaf 2D Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Cross-loop techniques • Current cross-loop and inter-procedural optimisation techniques are limited • Side-effects, branching, language limitations, different compilation units, etc. • Not enough information known at compile time • Run-time analysis and optimisations instead • Already established in several fields – e.g. task DAG • With loop descriptors & user contract we can delay the execution of these loops • Up until an API call that returns data to the user • Now we have a sequence of loops at run-time that we can analyse together • Block Synchronous Programming → task graph of arbitrary granularity Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Cross-loop techniques • This is a hugely powerful tool • Determine application behaviour ahead of time • Schedule computations in a very flexible way: inspector-executor scheme • Some examples: • CloverLeaf CFD mini-app: 150-600 loops • OpenSBLI large scale CFD research code: 30-200 loops • Rolls-Royce Hydra CFD code: 20-50 loops • What can we do with it? • Some of the things we have done, without any changes to user code: • Cache-blocking tiling • Communication avoidance • Automated checkpointing Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Cache-blocking tiling • OPS – Structured meshes • Cannoot be meaningfully tiled with polyhedral compilers • Skewed tiles, intra-tile parallelisation – 1.7-3.5x speedup • Redundant compute over MPI – speedups hold up to 128 nodes • OP2 – Unstructured meshes – Fabio Luporini, Paul Kelly, and others • 1.3x on large seismic application CloverLeaf 3D Problem Scaling – Intel KNL NVIDIA P100 Number of Nodes Number of Nodes Intel Xeon E5 2697 v4 – CINECA Marconi Intel Xeon Phi x200 7210 X86 + P100 and Power8 + P100 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Checkpointing • Execution is a series of transactions • To create a checkpoint: • Anything read → saved • Anything (over)written → not saved • But at any given loop, only a few datasets are touched → keep going and save/not save unseen datasets at later loops • Automatic fast-forward after re-start • Many options in how to save data • Parallel I/O • Each process its own file • In-memory checkpoints with redundancy • Local file system with redundancyor parallel file system Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Challenges • Cost of conversion • To get the most benefit, the bulk of the code has to be converted • It can be done incrementally • Simpler than CUDA but more difficult than OpenMP/OpenACC • Could it be better automated? • Boilerplate code • How to do it? Currently source-to-source with python scripts • Use compilers and do even more optimisations, or go with C++ templates? • Long-term maintenance • Academic funding... But there are some good models out there Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Summary • Loop descriptors + User contract • More complex parallelisation and execution schemes • Complete management of data structures, layouts, movement • Easily fine-tuned for different target architectures • Integration with distributed memory comms/scheduling • Lot of potential in cross-loop techniques • Suitable for large and complex codes that don’t work with compiler approaches • Tiling subject itself could go a lot further • What more can we do with it? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Backup slides Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Domain Specific Active Libraries User application Is the abstraction general enough? Domain Specific API Source-to-source translation Back-end library Does it deliver performance? Target-specific high-performance app CPUs (AVX, SSE, OpenMP) GPUs (CUDA, OpenCL, OpenACC) Supercomputers (MPI)

OPS and OP2 abstractions • Blocks • Stencils • Datasets on blocks (coordinates, flow variables) • Computations: parallel loop over a block accessing data through stencils, describing type of access • Sets (vertices, edges) • Mappings (edges to vertices) • Data on sets (coordinates, flow variables) • Computations: parallel loop over a set accessing data through at most one level of indirection, describing type of access

Platform-specific optimisations - Unstructured Airfoil– SB, KNC Hydra - Auto-vectrisation Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

Loop descriptors and cross-loop techniques : portability , locality and more