1 / 73

Advanced Charm++ Tutorial

Advanced Charm++ Tutorial. Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19. How to Become a Charm++ Hacker. Advanced Charm++ Advanced Messaging Writing system libraries Groups Delegation Communication framework Advanced load-balancing Checkpointing

adamdaniel
Download Presentation

Advanced Charm++ Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor charm.cs.uiuc.edu 2004/10/19

  2. How to Become a Charm++ Hacker • Advanced Charm++ • Advanced Messaging • Writing system libraries • Groups • Delegation • Communication framework • Advanced load-balancing • Checkpointing • Threads • SDAG

  3. Advanced Messaging

  4. Prioritized Execution • If several messages available, Charm will process the message with highest priority • Otherwise, oldest message (FIFO) • Has no effect: • If only one message is available (common for network-bound applications!) • On outgoing messages • Very useful for speculative work, ordering timesteps, etc...

  5. Priority Classes • Charm++ scheduler has three queues: high, default, and low • As signed integer priorities: • -MAXINT Highest priority -- -1 • 0 Default priority • 1 -- +MAXINT Lowest priority • As unsigned bitvector priorities: • 0x0000 Highest priority -- 0x7FFF • 0x8000 Default priority • 0x8001 -- 0xFFFF Lowest priority

  6. Prioritized Marshalled Messages • Pass “CkEntryOptions” as last parameter • For signed integer priorities: CkEntryOptions opts; opts.setPriority(-1); fooProxy.bar(x,y,opts); • For bitvector priorities: CkEntryOptions opts; unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF}; opts.setPriority(64,prio); fooProxy.bar(x,y,opts);

  7. Prioritized Messages • Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; • Priorities stored at the end of messages • Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); • Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

  8. Advanced Message Features • Read-only messages • Entry method agrees not to modify or delete the message • Avoids message copy for broadcasts, saving time • Expedited messages • Message do not go through the charm++ scheduler (faster) • Immediate messages • Entries are executed in a interrupt or the communication thread • Very fast, but tough to get right

  9. Read-Only, Expedited, Immediate • All declared in the .ci file { .. ... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); .. .. .. }; // Immediate messages only currently work //for NodeGroups

  10. Groups

  11. Object Groups • A collection of objects (chares) • Also called branch office chares • Exactly one representative on each processor • Ideally suited for system libraries • A single proxy for the group as a whole • Similar to arrays: • Broadcasts, reductions, indexing • But not completely like arrays: • Non-migratable; one per processor

  12. Declarations • .ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; • C++ file class mygroup : public Group { mygroup() {} void foo(foomsg *m) { CkPrintf(“Do Nothing”);} };

  13. Creating and Calling Groups • Creation p = CProxy_mygroup::ckNew(); • Remote invocation p.foo(msg); //broadcast p[1].foo(msg); //asynchronous invocation • Direct local access mygroup *g=p.ckLocalBranch(); g->foo(….); //local invocation • Danger: if you migrate, the group stays behind!

  14. Delegation

  15. Delegation • Enables Charm++ proxy messages to be forwarded to a delegation manager group • Delegation manager can trap calls to proxy sends and apply optimizations • Delegation manager must inherit from CkDelegateMgr • User program must to call • proxy.ckDelegate(mgrID);

  16. Delegation Interface • .ci file group MyDelegateMgr { entry MyDelegateMgr(); //Constructor }; • .h file class MyDelegateMgr : public CkDelegateMgr { MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. …………….. }

  17. Communication Optimization

  18. Automatic Communication Optimizations • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime • E.g. All to all communication • Performance depends on many runtime characteristics • Library switches between different algorithms • Communication is from/to objects, not processors • Streaming messages optimization

  19. Managing Collective Communication • Communication operation where all (or most) the processors participate • For example broadcast, barrier, all reduce, all to all communication etc • Applications: NAMD multicast, NAMD PME, CPAIMD • Issues • Performance impediment • Naïve implementations often do not scale • Synchronous implementations do not utilize the co-processor effectively

  20. All to All Communication • All processors send data to all other processors • All to all personalized communication (AAPC) • MPI_Alltoall • All to all multicast/broadcast (AAMC) • MPI_Allgather

  21. Strategies For AAPC • Short message optimizations • High software over head (α) • Message combining • Large messages • Network contention

  22. Short Message Optimizations • Direct all to all communication is α dominated • Message combining for small messages • Reduce the total number of messages • Multistage algorithm to send messages along a virtual topology • Group of messages combined and sent to an intermediate processor which then forwards them to their final destinations • AAPC strategy may send same message multiple times

  23. Virtual Topology: Mesh Phase 1: Processors send messages to row neighbors Phase 2: Processors send messages to column neighbors 2* messages instead of P-1 Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2)

  24. AAPC Performance

  25. Large Message Issues • Network contention • Contention free schedules • Topology specific optimizations

  26. Ring Strategy for Collective Multicast …… …….. 0 1 2 i i+1 P-1 • Performs all to all multicast by sending messages along a ring formed by the processors • Congestion free on most topologies

  27. Streaming Messages • Programs often have streams of short messages • Streaming library combines a bunch of messages and sends them off • Stripping large charm++ header • Short array message packing • Effective message performance of about 3us

  28. Using communication library • Communication optimizations embodied as strategies • EachToManyMulticastStrategy • RingMulticast • PipeBroadcast • Streaming • MeshStreaming

  29. Bracketed vs. Non-bracketed • Bracketed Strategies • Require user to give specific end points for each iteration of message sends • Endpoints declared by calling ComlibBegin() and ComlibEnd() • Examples: EachToManyMulticast • Non bracketed strategies • No such end points necessary • Examples: Streaming, PipeBroadcast

  30. Accessing the Communication Library • From mainchare::main • Creating a strategy Strategy *strat = new EachToManyMulticastStrategy(USE_MESH) Strat = new StreamingStrategy(); Strat->enableShortMessagePacking(); • Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy); • myproxy should be passed to all array elements

  31. Sending Messages ComlibBegin(myproxy);//Bracketed Strategies for(.. ..) { .. .. myproxy.foo(msg); .. .. } ComlibEnd(); //Bracketed strategies

  32. Handling Migration • Migrating array element PUP’s the comlib associated proxy FooArray::pup(PUP::er &p) { .. .. .. p | myProxy; .. .. .. }

  33. Compiling • You must include compile time option –module commlib

  34. Advanced Load-balancersWriting a Load-balancing Strategy

  35. Advanced load balancing: Writing a new strategy • Inherit from CentralLB and implement the work(…) function class foolb : public CentralLB { public: .. .. .. void work (CentralLB::LDStats* stats, int count); .. .. .. };

  36. LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //.. .. .. } //Dummy Work function which assigns all objects to //processor 0 //Don’t implement it! void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0; }

  37. Compiling and Integration • Edit and run Makefile_lb.sh • Creates Make.lb which is included by the LDB Makefile • Run make depends to correct dependencies • Rebuild charm++

  38. Checkpoint Restart

  39. Checkpoint/Restart • Any long running application must be able to save its state • When you checkpoint an application, it uses the pup routine to store the state of all objects • State information is saved in a directory of your choosing • Restore also uses pup, so no additional application code is needed (pup is all you need)

  40. Checkpointing Job • In AMPI, use MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete • In Charm++, use CkCheckpoint(<dir>,<resume>); • Called on one processor; calls resume when checkpoint is complete

  41. Restart Job from Checkpoint • The charmrun option ++restart <dir> is used to restart • Number of processors need not be the same • You can also restart groups by marking them migratable and writing a PUP routine – they still will not load balance, though

  42. Threads

  43. Why use Threads? • They provide one key feature: blocking • Suspend execution (e.g., at message receive) • Do something else • Resume later (e.g., after message arrives) • Example: MPI_Recv, MPI_Wait semantics • Function call interface more convenient than message-passing • Regular call/return structure (no CkCallbacks) • Allows blocking in middle of deeply nested communication subroutine

  44. Why not use Threads? • Slower • Around 1us context-switching overhead unavoidable • Creation/deletion perhaps 10us • More complexity, more bugs • Breaks a lot of machines! (but we have workarounds) • Migration more difficult • State of thread is scattered through stack, which is maintained by compiler • By contrast, state of object is maintained by users • Thread disadvantages form the motivation to use SDAG (later)

  45. What are (Charm) Threads? • One flow of control (instruction stream) • Machine Registers & program counter • Execution stack • Like pthreads (kernel threads) • Only different: • Implemented at user level (in Converse) • Scheduled at user level; non-preemptive • Migratable between nodes

  46. How do I use Threads? • Many options: • AMPI • Always uses threads via TCharm library • Charm++ • [threaded] entry methods run in a thread • [sync] methods • Converse • C routines CthCreate/CthSuspend/CthAwaken • Everything else is built on these • Implemented using • SYSV makecontext/setcontext • POSIX setjmp/alloca/longjmp

  47. How do I use Threads (example) • Blocking API routine: find array element int requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src) } • Send request and suspend int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return; } • Awaken thread when data arrives void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

  48. How do I use Threads (example) • Send request, suspend, recv, awaken, return int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); return stashed_return; } void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

  49. The Horror of Thread Migration

  50. Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data • Users have no control over how stack is laid out

More Related