1 / 26

Parallelization Strategies

Parallelization Strategies. Laxmikant Kale. Overview. OpenMP Strategies Need for adaptive strategies Object migration based dynamic load balancing Minimal modification strategies Thread based techniques: ROCFLO, .. Some future plans. OpenMP. Motivation:

lala
Download Presentation

Parallelization Strategies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelization Strategies Laxmikant Kale

  2. Overview • OpenMP Strategies • Need for adaptive strategies • Object migration based dynamic load balancing • Minimal modification strategies • Thread based techniques: ROCFLO, .. • Some future plans

  3. OpenMP • Motivation: • Shared memory model often easy to program • Incremental optimization possible

  4. ROCFLO via OpenMP • Parallelization of ROCFLO using a loop-parallel paradigm via OpenMP • Poor speedup compared with MPI version • Was locality the culprit? • Study conducted by Jay Hoeflinger • In collaboration with Fady Najjar

  5. ROCFLO with MPI

  6. The Methodology • Do OpenMP/MPI comparison experiments. • Write an OpenMP version of ROCFLO • Start with the MPI version of ROCFLO, • Duplicate the structure of the MPI code exactly (including message passing calls). • This removes locality as a problem. • Measure performance • If any parts do not scale well, determine why.

  7. Can't blame locality

  8. Barrier Cost: MPI vs OpenMP(Origin 2000)

  9. So Locality was not the whole problem! • The other problems turned out to be: • I/O which doesn’t scale • ALLOCATE which doesn’t scale • our non-scaling reduction implementation • our first-cut messaging infrastructure which, could be improved • Conclusion • Efficient loop parallel version may be feasible, avoiding Allocates and using scalable IO

  10. Need for adaptive strategies • Computation structure changes over time: • Combustion • Adaptive techniques in application codes: • Adaptive refinement in structures or even fluid • Other codes such as crack propagation • Can affect the load balance dramatically • One can go from 90% efficiency to less than 25%

  11. Multi-partition decompositions • Idea: decompose the problem into a number of partitions, • independent of the number of processors • # Partitions > # Processors • The system maps partitions to processors • The system should be able to map and re-map objects as needed

  12. Load Balancing Framework • Aimed at handling ... • Continuous (slow) load variation • Abrupt load variation (refinement) • Workstation clusters in multi-user mode • Measurement based • Exploits temporal persistence of computation and communication structures • Very accurate (compared with estimation) • instrumentation possible via Charm++/Converse

  13. Charm++ • A parallel C++ library • supports data driven objects • many objects per processor, with method execution scheduled with availability of data • system supports automatic instrumentation and object migration • Works with other paradigms: MPI, openMP, ..

  14. Load balancing framework

  15. Load balancing demonstration • To test the abilities of the framework • A simple problem: Gauss-Jacobi iterations • Refine selected sub-domains • AppSpector: web based tool • Submit parallel jobs • Monitor performance and application behavior • Interact with running jobs via GUI interfaces

  16. Adapitivity with minimal modification • Current code base is parallel (MPI) • But doesn’t support adaptivity directly • Rewrite the code with objects?... • Idea: support adaptivity with minimal changes to F90/MPI codes • Work by: • Milind Bhandarkar, Jay Hoeflinger, Eric de Sturler

  17. Migratable threads approach • Change required: • Encapsulate global variables in modules • Dynamically allocatable • Intercept MPI calls • Implement them in a multithreaded layer • Run each original MPI process as a thread • User level thread • Migrate threads as needed by load balancing • Trickier problem than object migration

  18. Progress: • Test Fortran-90 - C++ interface • Encapsulation feasibility: • Thread migration mechanics • ROCFLO study: • Test code implementation • ROCFLO implementation

  19. Another approach to adaptivity • Cleanly separate parallel and sequential code: • All parallel code in C++ • All application code in Fortran 90 sequential subroutines • Needs more restructuring of application codes • But is feasible, especially for new codes • Much easier to migrate • Improves modularity

More Related