1 / 18

Loop descriptors and cross-loop techniques : portability , locality and more

Loop descriptors and cross-loop techniques : portability , locality and more. István Z. Reguly, PPCU ITK r eguly.istvan@itk.ppke.hu

hshead
Download Presentation

Loop descriptors and cross-loop techniques : portability , locality and more

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loopdescriptors and cross-looptechniques: portability, locality and more István Z. Reguly, PPCU ITK reguly.istvan@itk.ppke.hu JointworkwithGihanMudalige (Warwick), Mike Giles (Oxford), Paul Kelly (Imperial), Rolls-Royce, Univs of Southampton and Bristol, UCL, and many-many more

  2. Outline • Loop descriptors • Per-loop techniques • Data layout • Auto-parallelisation: shared + distributed memory • Race conditions • Cross-loop techniques • Analysing loop chains • Locality and load balancing • Resiliency and more... Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  3. Parallel loop formulation • We all know the parallel for-each loops – STL, KOKKOS, etc. • Built around the idea of separation of concerns • Semantics: the order of execution doesn’t matter, plus extras: reduction, scan • Some form of iteration space • Various abstractions extend this with memory spaces (GPUs), execution policies (scheduling), layout policies (AoS-SoA) • Plus more complex loop constructs, such as nested parallelism, task DAG • The user can take more responsibility to specify lower-level implementation matters • Almost all of these target shared-memory parallelism only • MPI+X, but MPI is used through an abstraction too! Why not combine them? • What more can we do? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  4. Loop descriptors • Add further information (access-execute abstraction) • How data is accessed: Read/Write/Increment • Access pattern: • Structured meshes: n-point stencil • Unstructured meshes: indirection array: A[map[idx]] OPS OP2 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  5. User Contract • OPS and OP2 Domain Specific Languages • Contract with the user • For a certain region of code data is only accessed through API calls • No “side-effects” • Implications • Can manipulate data structures (e.g. AoS-SoA) • Can take full responsibility for data movement • Including MPI partitioning and halo communication • Including separate memory spaces (GPUs) • Can delay execution - later Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  6. Auto-parallelisation • The usual auto-parallelisation • For-each, reduction • Fewer implementation details on user • Sophisticated orchestration of parallelism • Common operation in unstructured meshes:indirect scatter increment • Atomics, colouring →two-level colouring • Improved data re-use • 2-3x faster than “global” colouring • Execution “plans” • Computed once, reused Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  7. Auto-parallelisation • Matching performance with hand-coded implementations RR Hydra – Sandy Bridge + K20 CloverLeaf– Sandy Bridge + K20 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  8. Pennycook’s Performance Portability Metric • H is a set of platforms • a is the application • p is the parameters for a • e is the performance efficiency measure S.J. Pennycook, J.D. Sewall, V.W. Lee, Implications of a metric for performance portability, In Future Generation Computer Systems, 2017, doi: 10.1016/j.future.2017.08.007 TeaLeaf 2D Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  9. Cross-loop techniques • Current cross-loop and inter-procedural optimisation techniques are limited • Side-effects, branching, language limitations, different compilation units, etc. • Not enough information known at compile time • Run-time analysis and optimisations instead • Already established in several fields – e.g. task DAG • With loop descriptors & user contract we can delay the execution of these loops • Up until an API call that returns data to the user • Now we have a sequence of loops at run-time that we can analyse together • Block Synchronous Programming → task graph of arbitrary granularity Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  10. Cross-loop techniques • This is a hugely powerful tool • Determine application behaviour ahead of time • Schedule computations in a very flexible way: inspector-executor scheme • Some examples: • CloverLeaf CFD mini-app: 150-600 loops • OpenSBLI large scale CFD research code: 30-200 loops • Rolls-Royce Hydra CFD code: 20-50 loops • What can we do with it? • Some of the things we have done, without any changes to user code: • Cache-blocking tiling • Communication avoidance • Automated checkpointing Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  11. Cache-blocking tiling • OPS – Structured meshes • Cannoot be meaningfully tiled with polyhedral compilers • Skewed tiles, intra-tile parallelisation – 1.7-3.5x speedup • Redundant compute over MPI – speedups hold up to 128 nodes • OP2 – Unstructured meshes – Fabio Luporini, Paul Kelly, and others • 1.3x on large seismic application CloverLeaf 3D Problem Scaling – Intel KNL NVIDIA P100 Number of Nodes Number of Nodes Intel Xeon E5 2697 v4 – CINECA Marconi Intel Xeon Phi x200 7210 X86 + P100 and Power8 + P100 Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  12. Checkpointing • Execution is a series of transactions • To create a checkpoint: • Anything read → saved • Anything (over)written → not saved • But at any given loop, only a few datasets are touched → keep going and save/not save unseen datasets at later loops • Automatic fast-forward after re-start • Many options in how to save data • Parallel I/O • Each process its own file • In-memory checkpoints with redundancy • Local file system with redundancyor parallel file system Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  13. Challenges • Cost of conversion • To get the most benefit, the bulk of the code has to be converted • It can be done incrementally • Simpler than CUDA but more difficult than OpenMP/OpenACC • Could it be better automated? • Boilerplate code • How to do it? Currently source-to-source with python scripts • Use compilers and do even more optimisations, or go with C++ templates? • Long-term maintenance • Academic funding... But there are some good models out there Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  14. Summary • Loop descriptors + User contract • More complex parallelisation and execution schemes • Complete management of data structures, layouts, movement • Easily fine-tuned for different target architectures • Integration with distributed memory comms/scheduling • Lot of potential in cross-loop techniques • Suitable for large and complex codes that don’t work with compiler approaches • Tiling subject itself could go a lot further • What more can we do with it? Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  15. Backup slides Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

  16. Domain Specific Active Libraries User application Is the abstraction general enough? Domain Specific API Source-to-source translation Back-end library Does it deliver performance? Target-specific high-performance app CPUs (AVX, SSE, OpenMP) GPUs (CUDA, OpenCL, OpenACC) Supercomputers (MPI)

  17. OPS and OP2 abstractions • Blocks • Stencils • Datasets on blocks (coordinates, flow variables) • Computations: parallel loop over a block accessing data through stencils, describing type of access • Sets (vertices, edges) • Mappings (edges to vertices) • Data on sets (coordinates, flow variables) • Computations: parallel loop over a set accessing data through at most one level of indirection, describing type of access

  18. Platform-specific optimisations - Unstructured Airfoil– SB, KNC Hydra - Auto-vectrisation Dagstuhl Seminar: Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

More Related