1 / 73

Programming to PetaScale with Multicore Chips and Early Experience on Abe with Charm++

This workshop overview covers lessons in parallelizing applications for petascale computing, new programming models, and simplifying parallel programming for performance and productivity improvement. Explore enabling technologies, benefits of virtual processors, software engineering techniques, and the collaboration between computer science research and applications. Dive deep into NCSA Abe's approach, methodologies, and the future of Charm++ programming techniques supporting multicore systems.

deangelo
Download Presentation

Programming to PetaScale with Multicore Chips and Early Experience on Abe with Charm++

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming to PetaScale with Multicore ChipsandEarly Experience on Abe with Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign NCSA Abe Multicore Workshop

  2. Outline • A series of lessons learned • How one should parallelize applications for the petascale • Early Experience on Abe • New Programming Models • Simplifying Parallel Programming NCSA Abe Multicore Workshop

  3. PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • How? • Develop novel Parallel programming techniques • Embody them into easy to use abstractions • So, application scientist can use advanced techniques with ease • Enabling technology: reused across many apps NCSA Abe Multicore Workshop

  4. Benefits • Software engineering • Number of virtual processors can be independently controlled • Separate VPs for different modules • Message driven execution • Adaptive overlap of communication • Predictability : • Automatic out-of-core • Asynchronous reductions • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI NCSA Abe Multicore Workshop

  5. Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Real Processors Migratable Objects (aka Processor Virtualization) Benefits Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI MPI processes Virtual Processors (user-level migratable threads) NCSA Abe Multicore Workshop

  6. Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE Quantum Chemistry (QM/MM) Protein Folding Molecular Dynamics Computational Cosmology Crack Propagation Parallel Objects, Adaptive Runtime System Libraries and Tools Space-time meshes Dendritic Growth Rocket Simulation NCSA Abe Multicore Workshop

  7. Application Oriented Parallel Abstractions Synergy between Computer Science Research and Apps has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa NCSA Abe Multicore Workshop

  8. Charm++ for Multicores • Announcing “beta” release of multicore version • A specialized stand-alone version for single desktops • Also, extended support for Abe-like multicore/SMP systems • Official release in a month or so NCSA Abe Multicore Workshop

  9. Porting Charm++ to Abe • Charm has a machine dependent layer • Frequently called the machine layer • First port: using existing MPI layers • mpi based layer: MPICH-VMI, [MVAPICH] • Using lower level layers • Multiple machine layers are usable on Abe • ibverbs layer that uses the verbs api directly • VMI

  10. Ibverbs layer • Reliable connection among all processors • Small messages – eager protocol • Large messages – RDMA • Eager protocol • Unexpected messages is the common case for a Charm program • We use an infiniband shared receive queue to post receive buffers for all processors

  11. Eager protocol contd.. • Packet based since pre-posted receive buffers have to be of a fixed size • Flow control among processors • Prevents one processor from flooding another one • Increases resources for a processor that is sending more messages to one • Has a memory pool for charm messages • One copy for short messages only on recv side • Necessary to merge multiple packet messages

  12. RDMA Layer • Zero copy messaging for RDMA • Rdma also part of flow control between processors • Rdma also being used for a persistent communication API (apart from regular messaging)

  13. Planned steps • Develop a smp version of the ibverbs layer • Improve communication performance of processors within a node • A separate thread for communication • Require locking • Reduce memory cost of scaling to large numbers of processors

  14. Lesson 1: Choose Your Algorithms NCSA Abe Multicore Workshop

  15. Choose your algorithms carefully • Create Parallelism where there was none • Parallel Prefix (scan) operation • Degree of parallelism • More is better, usually • Overlap of phases • Modern machines make one rethink algorithms: • Operation count may be less important than memory accesses • Degree of reuse NCSA Abe Multicore Workshop

  16. Analyze Scalability of the Algorithm (say via the iso-efficiency metric) NCSA Abe Multicore Workshop

  17. Equal efficiency curves Problem size processors Isoefficiency Analysis • An algorithm (*) is scalable if • If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount • Not all algorithms are scalable in this sense.. • Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency Parallel efficiency= T1/(Tp*P) T1 : Time on one processor Tp: Time on P processors NCSA Abe Multicore Workshop

  18. Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands of atoms (10,000 – 5,000,000) • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: using PME (3D FFT) • Multiple Time Stepping : PME every 4 timesteps • Calculate velocities and advance positions • Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers NCSA Abe Multicore Workshop

  19. Traditional Approaches: non isoefficient In 1996-2002 • Replicated Data: • All atom coordinates stored on each processor • Communication/Computation ratio: P log P • Partition the Atoms array across processors • Nearby atoms may not be on the same processor • C/C ratio: O(P) • Distribute force matrix to processors • Matrix is sparse, non uniform, • C/C Ratio: sqrt(P) Not Scalable Not Scalable Not Scalable NCSA Abe Multicore Workshop

  20. Spatial Decomposition Via Charm • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or“Patches” NCSA Abe Multicore Workshop

  21. Object Based Parallelization for MD: Force Decomposition + Spatial Decomposition • Now, we have many objects to load balance: • Each diamond can be assigned to any proc. • Number of diamonds (3D): • 14·Number of Patches • 2-away variation: • Half-size cubes • 5x5x5 interactions • 3-away interactions: 7x7x7 NCSA Abe Multicore Workshop

  22. Listen to Amdahl’s Law and Variants NCSA Abe Multicore Workshop

  23. Amdahl and variants • The original Amdahl’s law, interpreted as: • If there is a x% sequential component, speedup can’t be more than 100/x. • Variations: • If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts • If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use • … NCSA Abe Multicore Workshop

  24. Fine Grained Decomposition on BlueGene NCSA Abe Multicore Workshop

  25. Decouple decomposition from Physical Processors NCSA Abe Multicore Workshop

  26. Parallel Decomposition and Processors • MPI-style encourages • Decomposition into P pieces, where P is the number of physical processors available • If your natural decomposition is a cube, then the number of processors must be a cube • … • Charm++/AMPI style “virtual processors” • Decompose into natural objects of the application • Let the runtime map them to processors • Decouple decomposition from load balancing NCSA Abe Multicore Workshop

  27. LeanCP Car-Parinello ab initio MD • Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, .. • Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali) • Funding : NSF-CHE, NSF-CS, NSF-ITR, IBM NCSA Abe Multicore Workshop

  28. Parallelization under Charm++: NCSA Abe Multicore Workshop

  29. NCSA Abe Multicore Workshop

  30. Parallel scaling of liquid water* as a function of system size on the Blue Gene/L installation at YKT: *Liquid water has 4 states per molecule. • Weak scaling is observed! • Strong scaling on processor numbers up to ~60x the number of states! NCSA Abe Multicore Workshop

  31. Use Dynamic Load Balancing NCSA Abe Multicore Workshop

  32. Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing NCSA Abe Multicore Workshop

  33. Load Balancing Aggressive Load Balancing Refinement Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. NCSA Abe Multicore Workshop

  34. ChaNGa: Parallel Gravity • Collaborative project (NSF ITR) • with Prof. Tom Quinn, Univ. of Washington • Components: gravity, gas dynamics • Barnes-Hut tree codes • Oct tree is natural decomposition: • Geometry has better aspect ratios, and so you “open” fewer nodes up. • But is not used because it leads to bad load balance • Assumption: one-to-one map between sub-trees and processors • Binary trees are considered better load balanced • With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors NCSA Abe Multicore Workshop

  35. NCSA Abe Multicore Workshop

  36. 5.6s 6.1s Load balancing with GreedyLB dwarf 5M on 1,024 BlueGene/L processors NCSA Abe Multicore Workshop

  37. 5.6s 5.0s Load balancing with OrbRefineLB dwarf 5M on 1,024 BlueGene/L processors NCSA Abe Multicore Workshop

  38. ChaNGa Preliminary Performance ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ NCSA Abe Multicore Workshop

  39. ChaNGa Preliminary Performance on Abe ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ NCSA Abe Multicore Workshop

  40. ChaNGa Preliminary Performance on Abe ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ NCSA Abe Multicore Workshop

  41. ChaNGa on Abe: Larger dataset NCSA Abe Multicore Workshop

  42. Load Balancing • Adaptive load balancing examples • 1-D elastic-plastic wave propagation • Bar is dynamically loaded resulting in an elastic wave propagating down bar, upon reflection from the fixed end the material becomes plastic • 3-D dynamic elastic-plastic fracture • Load imbalance occurs at the onset of an element turning from elastic to plastic, zone of plasticity forms over a limited number of processors as the crack propagates Collaboration with Philippe Geubelle NCSA Abe Multicore Workshop

  43. Fractography on Abe Fractography: Structural dynamics, with cohesive elements Developed in Collaboration with Philippe Geubelle NCSA Abe Multicore Workshop

  44. Use Asynchronous Collectives • Barrier/reduction performance is not a problem • When you find processors waiting at a barrier, its usually because of load imbalances • But avoiding barriers to overlap phases is good! NCSA Abe Multicore Workshop

  45. NAMD Parallelization using Charm++ : PME 192 + 144 VPs 700 VPs 30,000 VPs These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system NCSA Abe Multicore Workshop

  46. 94% efficiency Shallow valleys, high peaks, nicely overlapped PME Apo-A1, on BlueGene/L, 1024 procs Charm++’s “Projections” Analysis too Time intervals on x axis, activity added across processors on Y axisl green: communication Red: integration Blue/Purple: electrostatics Orange: PME turquoise: angle/dihedral NCSA Abe Multicore Workshop

  47. 76% efficiency Cray XT3, 512 processors: Initial runs Clearly, needed further tuning, especially PME. But, had more potential (much faster processors) NCSA Abe Multicore Workshop

  48. On Cray XT3, 512 processors: after optimizations 96% efficiency NCSA Abe Multicore Workshop

  49. Abe: NAMD, Apo-A1, on 512 cores NCSA Abe Multicore Workshop

  50. Analyze Performance with Sophisticated Tools NCSA Abe Multicore Workshop

More Related