1 / 61

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

This paper discusses the use of adaptive MPI and intelligent runtime strategies to enhance the performance and productivity of parallel applications. The approach involves developing novel parallel programming techniques and embodying them into easy-to-use abstractions. The enabling technology of parallel objects and intelligent runtime systems has been applied to various collaborative applications in computational science and engineering.

Download Presentation

Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign ICL@UTK

  2. PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • How? • Develop novel Parallel programming techniques • Embody them into easy to use abstractions • So, application scientist can use advanced techniques with ease • Enabling technology: reused across many apps ICL@UTK

  3. Develop abstractions in context of full-scale applications Protein Folding Quantum Chemistry (QM/MM) Molecular Dynamics Computational Cosmology Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Dendritic Growth Space-time meshes Rocket Simulation The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE ICL@UTK

  4. Benefits • Software engineering • Number of virtual processors can be independently controlled • Separate VPs for different modules • Message driven execution • Adaptive overlap of communication • Predictability : • Automatic out-of-core • Asynchronous reductions • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI ICL@UTK

  5. Adaptive MPI Load Balancing Fault tolerance Projections: performance analysis Performance prediction bigsim Outline ICL@UTK

  6. MPI processes MPI “processes” Implemented as virtual processes (user-level migratable threads) Real Processors AMPI: MPI with Virtualization • Each virtual process implemented as a user-level thread embedded in a Charm object ICL@UTK

  7. Making AMPI work • Multiple user-level threads per processor: • Problems with global variable • Solution I: • Automatic: switch GOT pointer at context switch • Available on most machines • Solution 2: Manual: replace global variables • Solution 3: automatic: via compiler support (AMPIzer) • Migrating Stacks • Use isomalloc technique (Mehaut et al) • Use memory files and mmap() • Heap data: • Isomalloc heaps • OR user supplied pack/unpack functions for the heap data ICL@UTK

  8. ELF and global variables • The Executable and Linking Format (ELF) • Executable has a Global Offset Table containing global data • GOT pointer stored at %ebx register • Switch this pointer when switching between threads • Support on Linux, Solaris 2.x, and more • Integrated in Charm++/AMPI • Invoke by compile time option -swapglobal ICL@UTK

  9. Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995 ICL@UTK

  10. Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8. ICL@UTK

  11. Performance Slightly worse w/o optimization Being improved Flexibility Small number of PE available Special requirement by algorithm Comparison with Native MPI Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #. ICL@UTK

  12. AMPI Extensions • Automatic load balancing • Non-blocking collectives • Checkpoint/restart mechanism • Multi-module programming ICL@UTK

  13. Load Balancing in AMPI • Automatic load balancing: MPI_Migrate() • Collective call informing the load balancer that the thread is ready to be migrated, if needed. ICL@UTK

  14. Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing : object migration Instrumented Timesteps Refinement Load Balancing ICL@UTK

  15. Load Balancing Aggressive Load Balancing Refinement Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. ICL@UTK

  16. Shrink/Expand • Problem: Availability of computing platform may change • Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 ICL@UTK

  17. Radix Sort Optimized All-to-all “Surprise” Completion time vs. computation overhead 76 bytes all-to-all on Lemieux CPU is free during most of the time taken by a collective operation 900 800 700 600 Led to the development of Asynchronous Collectives now supported in AMPI AAPC Completion Time(ms) 500 400 300 200 100 0 100B 200B 900B 4KB 8KB Message Size (bytes) Mesh Direct ICL@UTK

  18. Asynchronous Collectives • Our implementation is asynchronous • Collective operation posted • Test/wait for its completion • Meanwhile useful computation can utilize CPU MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); ICL@UTK

  19. Fault Tolerance ICL@UTK

  20. Motivation • Applications need fast, low cost and scalable fault tolerance support • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Our research • Disk based Checkpoint/Restart • In Memory Double Checkpointing/Restart • Sender based Message Logging • Proactive response to fault prediction • (impending fault response) ICL@UTK

  21. Checkpoint/Restart Mechanism • Automatic Checkpointing for AMPI and Charm++ • Migrate objects to disk! • Automatic fault detection and restart • Now available in distribution version of AMPI and Charm++ • Blocking Co-ordinated Checkpoint • States of chares are checkpointed to disk • Collective call MPI_Checkpoint(DIRNAME) • The entire job is restarted • Virtualization allows restarting on different # of processors • Runtime option • > ./charmrun pgm +p4 +vp16 +restart DIRNAME • Simple but effective for common cases ICL@UTK

  22. In-memory Double Checkpoint • In-memory checkpoint • Faster than disk • Co-ordinated checkpoint • Simple MPI_MemCheckpoint(void) • User can decide what makes up useful state • Double checkpointing • Each object maintains 2 checkpoints: • Local physical processor • Remote “buddy” processor • For jobs with large memory • Use local disks! 32 processors with 1.5GB memory each ICL@UTK

  23. Restart • A “Dummy” process is created: • Need not have application data or checkpoint • Necessary for runtime • Starts recovery on all other Processors • Other processors: • Remove all chares • Restore checkpoints lost on the crashed PE • Restore chares from local checkpoints • Load balance after restart ICL@UTK

  24. Restart Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps ICL@UTK

  25. Scalable Fault Tolerance • How? • Sender-side message logging • Asynchronous Checkpoints on buddy processors • Latency tolerance mitigates costs • Restart can be speeded up by spreading out objects from failed processor • Current progress • Basic scheme idea implemented and tested in simple programs • General purpose implementation in progress Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue” ICL@UTK

  26. Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s) ICL@UTK

  27. Projections: Performance visualization and analysis tool ICL@UTK

  28. An Introduction to Projections • Performance Analysis tool for Charm++ based applications. • Automatic trace instrumentation. • Post-mortem visualization. • Multiple advanced features that support: • Data volume control • Generation of additional user data. ICL@UTK

  29. Trace Generation • Automatic instrumentation by runtime system • Detailed • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer. • Summary • Reduces the size of output files and memory overhead. • Produces a few lines of output data per processor. • This data is recorded in bins corresponding to intervals of size 1 ms by default. • Flexible • APIs and runtime options for instrumenting user events and data generation control. ICL@UTK

  30. The Summary View • Provides a view of the overall utilization of the application. • Very quick to load. ICL@UTK

  31. Graph View • Features: • Selectively view Entry points. • Convenient means to switch to between axes data types. ICL@UTK

  32. Timeline • The most detailed view in Projections. • Useful for understanding critical path issues or unusual entry point behaviors at specific times. ICL@UTK

  33. Animations ICL@UTK

  34. ICL@UTK

  35. ICL@UTK

  36. ICL@UTK

  37. Time Profile • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool. Solved by prioritizing entry methods ICL@UTK

  38. ICL@UTK

  39. ICL@UTK

  40. Overview: one line for each processor, time on X-axis White: busy, Black:idle Red: intermediate ICL@UTK

  41. A boring but good-performance overview ICL@UTK

  42. An interesting but pathetic overview ICL@UTK

  43. Stretch Removal Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis After Optimizations About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms Before Optimizations Over 16 large stretched calls ICL@UTK

  44. Miscellaneous Features -Color Selection • Colors are automatically supplied by default. • We allow users to select their own colors and save them. • These colors can then be restored the next time Projections loads. ICL@UTK

  45. User APIs • Controlling trace generation • void traceBegin() • void traceEnd() • Tracing User (Events • int traceRegisterUserEvent(char *, int) • void traceUserEvent(char *) • void traceUserBracketEvent(int, double, double) • double CmiWallTimer() • Runtime options: • +traceoff • +traceroot <directory> • Projections mode only: • +logsize <# entries> • +gz-trace • Summary mode only: • +bincount <# of intervals> • +binsize <interval time quanta (us)> ICL@UTK

  46. Performance Prediction Via Parallel Simulation ICL@UTK

  47. BigSim: Performance Prediction • Extremely large parallel machines are around already/about to be available: • ASCI Purple (12k, 100TF) • Bluegene/L (64k, 360TF) • Bluegene/C (1M, 1PF) • How to write a petascale application? • What will be the Performance like? • Would existing parallel applications scale? • Machines are not there • Parallel Performance is hard to model without actually running the program ICL@UTK

  48. Objectives and Simualtion Model • Objectives: • Develop techniques to facilitate the development of efficient peta-scale applications • Based on performance prediction of applications on large simulated parallel machines • Simulation-based Performance Prediction: • Focus on Charm++ and AMPI programming models Performance prediction based on PDES • Supports varying levels of fidelity • processor prediction, network prediction. • Modes of execution : • online and post-mortem mode ICL@UTK

  49. Blue Gene Emulator/Simulator • Actually: BigSim, for simulation of any large machine using smaller parallel machines • Emulator: • Allows development of programming environment and algorithms before the machine is built • Allowed us to port Charm++ to real BG/L in 1-2 days • Simulator: • Allows performance analysis of programs running on large machines, without access to the large machines • Uses Parallel Discrete Event Simulation ICL@UTK

  50. Architecture of BigNetSim ICL@UTK

More Related