1 / 142

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks. Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign.

jon
Download Presentation

Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Petascale Programming with Virtual Processors:Charm++, AMPI, and domain-specific frameworks Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

  2. Challenges and opportunities: character of the new machines Charm++ and AMPI Basics Capabilities, Programming techniques Dice them fine: VPS to the rescue Juggling for overlap Load balancing: scenarios and strategies Case studies Classical Molecular Dynamics Car-Parinello AI MD Quantum Chemistry Rocket SImulation Raising level of abstraction: Higher level compiler supported notations Domain-specific “frameworks” Example: Unstructured mesh (FEM) framework Outline

  3. Current: Lemieux: 3000 processors, 750 nodes, full-bandwidth fat-tree network ASCI Q: similar architecture System X: Infiniband Tungston: myrinet Thunder Earth Simulator Planned: IBM’s Blue Gene L: 65k nodes, 3D-taurus topology Red Storm (10k procs) Future? BG/L is an example: 1M processors! 0.5 MB per procesor HPCS 3 architecural plans Machines: current, planned and future

  4. Some Trends: Communication • Bisection bandwidth: • Can’t scale as well with number of processors • without being expensive • Wire-length delays • even on lemieux: messages going thru the highest level switches take longer • Two possibilities: • Grid topologies, with near neighbor connections • High Link speed, low bisection bandwidth • Expensive, full-bandwidth networks

  5. Trends: Memory • Memory latencies are 100 times slower than processor! • This will get worse • A solution: put more processors in, • To increase bandwidth between processors and memory • On chip DRAM • In other words: low memory-to-processor ratio • But this can be handled with programming style • Application viewpoint, for physical modeling: • Given a fixed amount of run-time (4 hours or 10 days) • Doubling spatial resolution • increases CPU needs more than 2-fold (smaller time-steps)

  6. Application Complexity is increasing • Why? • With more FLOPS, need better algorithms.. • Not enough to just do more of the same.. • Example: Dendritic growth in materials • Better algorithms lead to complex structure • Example: Gravitational force calculation • Direct all-pairs: O(N2), but easy to parallelize • Barnes-Hut: N log(N) but more complex • Multiple modules, dual time-stepping • Adaptive and dynamic refinements • Ambitious projects • Projects with new objectives lead to dynamic behavior and multiple components

  7. Specific Programming Challenges • Explicit management of resources • This data on that processor • This work on that processor • Analogy: memory management • We declare arrays, and malloc dynamic memory chunks as needed • Do not specify memory addresses • As usual, Indirection is the key • Programmer: • This data, partitioned into these pieces • This work divided that way • System: map data and work to processors

  8. Virtualization: Object-based Parallelization • Idea: Divide the computation into a large number of objects • Let the system map objects to processors User is only concerned with interaction between objects System implementation User View

  9. Virtualization: Charm++ and AMPI • These systems seek an optimal division of labor between the “system” and programmer: • Decomposition done by programmer, • Everything else automated Decomposition Mapping Charm++ HPF Abstraction Scheduling Expression MPI Specialization

  10. Charm++: Parallel C++ Asynchronous methods Object arrays In development for over a decade Basis of several parallel applications Runs on all popular parallel machines and clusters AMPI: A migration path for legacy MPI codes Allows them dynamic load balancing capabilities of Charm++ Uses Charm++ object arrays Minimal modifications to convert existing MPI programs Automated via AMPizer Collaboration w. David Padua Bindings for C, C++, and Fortran90 Charm++ and Adaptive MPI Both available from http://charm.cs.uiuc.edu

  11. Protein Folding Quantum Chemistry (QM/MM) Molecular Dynamics Computational Cosmology Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Dendritic Growth Space-time meshes Rocket Simulation The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE

  12. Message From This Talk • Virtualization is ready and powerful to meet the needs of tomorrows applications and machines • Virtualization and associated techniques that we have been exploring for the past decade are ready and powerful enough to meet the needs of high-end parallel computing and complex and dynamic applications • These techniques are embodied into: • Charm++ • AMPI • Frameworks (Strucured Grids, Unstructured Grids, Particles) • Virtualization of other coordination languages (UPC, GA, ..)

  13. Graduate students including: Gengbin Zheng Orion Lawlor Milind Bhandarkar Terry Wilmarth Sameer Kumar Jay deSouza Chao Huang Chee Wai Lee Recent Funding: NSF (NGS: Frederica Darema) DOE (ASCI : Rocket Center) NIH (Molecular Dynamics) Acknowlwdgements

  14. Charm++ : Object Arrays • A collection of data-driven objects (aka chares), • With a single global name for the collection, and • Each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..]

  15. Charm++ : Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3]

  16. Chare Arrays • Elements are data-driven objects • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... • Send messages to index, receive messages at element. Reductions and broadcasts across the array • Dynamic insertion, deletion, migration-- and everything still has to work!

  17. array[1D] foo { entry void foo(int problemNo); entry void bar(int x); }; Interface (.ci) file Generated class CProxy_foo someFoo=...; someFoo[i].bar(17); In a .C file i’th object Charm++ Remote Method Calls • To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: method and parameters • This results in a network message, and eventually to a call to the real object’s method: In another .C file void foo::bar(int x) { ... }

  18. Charm++ Startup Process: Main module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Interface (.ci) file Special startup object Generated class #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” Called at startup In a .C file

  19. Other Features • Broadcasts and Reductions • Runtime creation and deletion • nD and sparse array indexing • Library support (“modules”) • Groups: per-processor objects • Node Groups: per-node objects • Priorities: control ordering

  20. AMPI: “Adaptive” MPI • MPI interface, for C and Fortran, implemented on Charm++ • Multiple “virtual processors” per physical processor • Implemented as user-level threads • Very fast context switching-- 1us • E.g., MPI_Recv only blocks virtual processor, not physical • Supports migration (and hence load balancing) via extensions to MPI

  21. 7 MPI processes AMPI:

  22. 7 MPI “processes” Real Processors AMPI: Implemented as virtual processors (user-level migratable threads)

  23. How to Write an AMPI Program • Write your normal MPI program, and then… • Link and run with Charm++ • Compile and link with charmc • charmc -o hello hello.c -language ampi • charmc -o hello2 hello.f90 -language ampif • Run with charmrun • charmrun hello

  24. How to Run an AMPI program • Charmrun • A portable parallel job execution script • Specify number of physical processors: +pN • Specify number of virtual MPI processes: +vpN • Special “nodelist” file for net-* versions

  25. AMPI MPI Extensions • Process Migration • Asynchronous Collectives • Checkpoint/Restart

  26. How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables

  27. Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data

  28. Migrate Stack Data • Without compiler support, cannot change stack’s address • Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) • Solution: “isomalloc” addresses • Reserve address space on every processor for every thread stack • Use mmap to scatter stacks in virtual memory efficiently • Idea comes from PM2

  29. Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code 0x00000000 0x00000000

  30. Migrate Stack Data Processor A’s Memory Processor B’s Memory 0xFFFFFFFF 0xFFFFFFFF Thread 1 stack Thread 2 stack Migrate Thread 3 Thread 3 stack Thread 4 stack Heap Heap Globals Globals Code Code 0x00000000 0x00000000

  31. Migrate Stack Data • Isomalloc is a completely automatic solution • No changes needed in application or compilers • Just like a software shared-memory system, but with proactive paging • But has a few limitations • Depends on having large quantities of virtual address space (best on 64-bit) • 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine • Depends on unportable mmap • Which addresses are safe? (We must guess!) • What about Windows? Blue Gene?

  32. Heap Data • Heap data is any dynamically allocated data • C “malloc” and “free” • C++ “new” and “delete” • F90 “ALLOCATE” and “DEALLOCATE” • Arrays and linked data structures are almost always heap data

  33. Migrate Heap Data • Automatic solution: isomalloc all heap data just like stacks! • “-memory isomalloc” link option • Overrides malloc/free • No new application code needed • Same limitations as isomalloc • Manual solution: application moves its heap data • Need to be able to size message buffer, pack data into message, and unpack on other side • “pup” abstraction does all three

  34. Comparison with Native MPI • Performance • Slightly worse w/o optimization • Being improved • Flexibility • Small number of PE available • Special requirement by algorithm Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #.

  35. Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Modularity Predictability Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Principle of persistence Enables runtime optimizations Automatic dynamic load balancing Communication optimizations Other runtime optimizations Benefits of Virtualization More info: http://charm.cs.uiuc.edu

  36. Data driven execution Scheduler Scheduler Message Q Message Q

  37. Adaptive Overlap of Communication • With Virtualization, you get Data-driven execution • There are multiple entities (objects, threads) on each proc • No single object or threads holds up the processor • Each one is “continued” when its data arrives • No need to guess which is likely to arrive first • So: Achieves automatic and adaptive overlap of computation and communication • This kind of data-driven idea can be used in MPI as well. • Using wild-card receives • But as the program gets more complex, it gets harder to keep track of all pending communication in all places that are doing a receive

  38. Why Message-Driven Modules ? SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

  39. Checkpoint/Restart • Any long running application must be able to save its state • When you checkpoint an application, it uses the pup routine to store the state of all objects • State information is saved in a directory of your choosing • Restore also uses pup, so no additional application code is needed (pup is all you need)

  40. Checkpointing Job • In AMPI, use MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete • In Charm++, use CkCheckpoint(<dir>,<resume>); • Called on one processor; calls resume when checkpoint is complete • Restarting: • The charmrun option ++restart <dir> is used to restart • Number of processors need not be the same

  41. AMPI’s Collective Communication Support • Communication operation in which all or a large subset participate • For example broadcast • Performance impediment • All to all communication • All to all personalized communication (AAPC) • All to all multicast (AAM)

  42. Communication Optimization Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1 But each byte travels twice on the network

  43. Radix Sort Performance Benchmark A Mystery ?

  44. CPU time vs Elapsed Time Time breakdown of an all-to-all operation using Mesh library • Computation is only a small proportion of the elapsed time • A number of optimization techniques are developed to improve collective communication performance

  45. Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms] • VPs implemented as threads • Overlapping computation with waiting time of collective operations • Total completion time reduced

  46. Shrink/Expand • Problem: Availability of computing platform may change • Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

  47. Projections Performance Analysis Tool

  48. Projections • Projections is designed for use with a virtualized model like Charm++ or AMPI • Instrumentation built into runtime system • Post-mortem tool with highly detailed traces as well as summary formats • Java-based visualization tool for presenting performance information

  49. Trace Generation (Detailed) • Link-time option “-tracemode projections” • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer • Memory footprint controlled by limiting number of log entries • I/O perturbation can be reduced by increasing number of log entries • Generates a <name>.<pe>.log file for each processor and a <name>.sts file for the entire application • Commonly used Run-time options +traceroot DIR +logsize NUM

  50. Visualization Main Window

More Related