Basic Charm++ and Load Balancing

Basic Charm++ and Load Balancing Gengbin Zheng charm.cs.uiuc.edu 10/11/2005

Charm++ Basics

Charm++ • Parallel library for Object-Oriented C++ applications • Invoke functions remotely • Messaging via remote method calls (like CORBA) • Communication “proxy” objects • Methods called by scheduler • System determines who runs next • Multiple objects per processor • Object migration fully supported • Even with broadcasts, reductions

Virtualized Programming Model • User writes code in terms of communicating objects • System maps objects to processors System implementation User View

Chares – Concurrent Objects • Can be dynamically created on any available processor • Can be accessed from remote processors • Send messages to each other asynchronously • Contain “entry methods”

Charm++ Features: Object Arrays • Applications are written as a set of communicating objects User’s view A[0] A[1] A[2] A[3] A[n]

Charm++ Features: Object Arrays • Charm++ maps those objects onto processors, routing messages as needed User’s view A[0] A[1] A[2] A[3] A[n] System view A[0] A[3]

Charm++ Features: Object Arrays • Charm++ can re-map (migrate) objects for communication, load balance, fault tolerance, etc. User’s view A[0] A[1] A[2] A[3] A[n] System view A[0] A[3]

Charm++ Array Definition Interface (.ci) file array[1D] foo { entry foo(int problemNo); entry void bar(int x); } In a .C file class foo : public CBase_foo { public: // Remote calls foo(int problemNo) { ... } void bar(int x) { ... } // Migration support: foo(CkMigrateMessage *m) {} void pup(PUP::er &p) {...} };

Charm++ Remote Method Calls Interface (.ci) file • To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: array[1D] foo { entry foo(int problemNo); entry void bar(int x); }; Generated class In a .C file CProxy_foo someFoo=...; someFoo[i].bar(17); i’th object method and parameters • This results in a network message, and eventually to a call to the real object’s method: In another .C file void foo::bar(int x) { ... }

Charm++ Startup Process: Main Interface (.ci) file module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Special startup object In a .C file Generated class #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” Called at startup on PE 0

“Hello World!” #include “hello.decl.h” class mymain : public CBase_mymain{ public: mymain(CkArgMsg *m) { ckout <<“Hello World” <<endl; CkExit(); } }; #include “hello.def.h” .ci file mainmodule hello { mainchare mymain { entry mymain(CkArgMsg *m); }; }; Generates hello.decl.h hello.def.h .C file

Compile and run the program Compiling • charmc <options> <source file> • -o, -g, -language, -module, -tracemode pgm: pgm.ci pgm.h pgm.C charmc pgm.ci charmc pgm.C charmc –o pgm pgm.o –language charm++ Example Nodelist File: group main ++shell ssh host Host1 host Host2 To run a CHARM++ program named ``pgm'' on four processors, type: charmrun pgm +p4 <params> Nodelist file (for network architecture) • list of machines to run the program • host <hostname> <qualifiers>

Charm++: Portability • Runs on: • Any machine with MPI, including • IBM SP, Blue Gene/L • Cray XT3 • Origin2000 • PSC’s Lemieux (Quadrics Elan) • Clusters with Ethernet (Udp/Tcp) • Clusters with Myrinet (GM) • Clusters with Amasso cards • Apple clusters • Even Windows! • SMP-Aware (pthreads)

Build Charm++ • Download from website • http://charm.cs.uiuc.edu/download.html • Build Charm++ • ./build <target> <version> <options> [compile flags] • ./build charm++ net-linux gm -g • Parallel make (-j2) • Compile code using charmc • Portable compiler wrapper • Link with “-language charm++” • Run code using charmrun

How Charmrun Works? ssh connect Acknowledge Charmrun charmrun +p4 ./pgm

Charmrun (batch mode) ssh connect Acknowledge Charmrun charmrun ++batch 8

Debugging Charm++ Applications • Printf • Gdb • Sequentially (standalone mode) • gdb ./pgm +vp16 • Run debugger in xterm • charmrun +p4 pgm ++debug • charmrun +p4 pgm ++debug-no-pause • Memory paranoid • Parallel debugger

Charm++ Features

Message Driven Execution Scheduler Scheduler Message Q Message Q Virtualization leads to Message Driven Execution

Prioritized Messages • Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; • Priorities stored at the end of messages • Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); • Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

Advanced Message Features • Expedited messages • Message do not go through the charm++ scheduler (faster) • Top priority messages • Immediate messages • Entries are executed in an interrupt or the communication thread • Very fast, but tough to get right

Object Migration

How to Migrate a Virtual Processor? • Move all application state to new processor • Stack Data (threads) • Subroutine variables and calls • Managed by compiler • Heap Data • Allocated with malloc/free • Managed by user • Global Variables • Open files, environment variables, etc. (not handled yet!)

Migration Solutions • Stack Data (threads) • Automatic: isomalloc stacks • Heap Data • Use “-memory isomalloc” -or- • Write pup routines • Global Variables • Use “-swapglobals” • Works on ELF platform (Linux and Sun) • Just a pointer swap, no data copying • -or- • Remove globals entirely

Migrate Heap Data: PUP • Packing/unpacking user allocated data • Basic contract: here is my data • Sizing: counts up data size • Packing: copies data into message • Unpacking: copies data back out • Same call works for network, memory, disk I/O ...

Migrate Heap Data: PUP C++ Example #include “pup.h” #include “pup_stl.h” class myMesh { std::vector<float> nodes; std::vector<int> elts; public: ... void pup(PUP::er &p) { p|nodes; p|elts; } };

Migrate Heap Data: PUP F90 Example TYPE(myMesh) INTEGER :: nn,ne REAL*4, ALLOCATABLE(:) :: nodes INTEGER, ALLOCATABLE(:) :: elts END TYPE SUBROUTINE pupMesh(p,mesh) USE MODULE ... INTEGER :: p TYPE(myMesh) :: mesh fpup_int(p,mesh%nn) fpup_int(p,mesh%ne) IF (fpup_isUnpacking(p)) THEN ALLOCATE(mesh%nodes(mesh%nn)) ALLOCATE(mesh%elts(mesh%ne)) END IF fpup_floats(p,mesh%nodes,mesh%nn); fpup_ints(p,mesh%elts,mesh%ne); IF (fpup_isDeleting(p)) deleteMesh(mesh); END SUBROUTINE

Automatic Load Balancing

Motivation • Irregular or dynamic applications • Initial static load balancing • Application behaviors change dynamically • Difficult to implement with good parallel efficiency • Versatile, automatic load balancers • Application independent • No/little user effort is needed in load balance • Work for both Charm++ and Adaptive MPI

Using Dynamic Mapping to Processors • Migrate objects between processors • Use that for dynamic (and static, initial) load balancing • Two major approaches • No predictability of load patterns • Fully dynamic • Early work on State Space Search, Branch&Bound, .. • With certain predictability • Measurement-based load balancing strategy • CSE, molecular dynamics simulation

Applications lack of predictability • Flow of tasks - application generates a continuous flow of tasks • The goal of the load balancing strategies is to balance these tasks across the system for a fast response time and a better throughput • Tasks are assigned at creation time, no migration afterwards

Seed Load Balancing • Neighborhood averaging with work-stealing when Idle using immediate messages • Load balancing among neighboring processors • Load is represented by length of queue • Work-stealing at idle time with interruption-based message • Fast response to the request 80000 objects, 10% heavy objects

Link with a seed load balancer • Use –balance <random|neighbor> • Charmc –o pgm pgm.o –balance neighbor • Specify topology • +LBTopo <ring|torus2d|…>

Principle of Persistence • Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt and large,but infrequent changes (eg:AMR) • Slow and small changes (eg: particle migration) • Parallel analog of principle of locality • Heuristics, that holds for most CSE applications • Run-time instrumentation is possible

Measurement Based Load Balancing • Runtime instrumentation • Measures CPU load per object • Measures communication volume between objects • Measurement based load balancers • Use the instrumented database periodically to make new decisions • A load balancing strategy takes the database as input and generates a new object-to-processor mapping

Load Balancing – graph partitioning Weighted object graph in view of Load Balancer mapping of objects LB View Charm++ PE

Charm++ Load Balancer in Action Automatic Load Balancing in Crack Propagation

Centralized Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier Load Balancer Categories

Main Centralized Load Balancing Strategies • GreedyCommLB • a “greedy” load balancing strategy which uses the process load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor • RefineLB • Incremental adjustment by moving objects off overloaded processors to under-utilized processors to reach average load • MetisLB • uses the METIS graph partitioning library to partition the object-communication graph with node (object) weights and communication loads on edges. • OrbLB • treats objects with spatial coordinates. It applies an orthogonal recursive bisection algorithm which attempts to provide a more balanced division of space. • Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed

Load Balancing Strategies

Neighborhood Load Balancing Strategies • NeighborLB • processor tries to average out its load only among its neighbors • WSLB • A load balancer for timeshared workstation clusters, which can detect load changes on desktops and adjust load without interferes with other's use of the desktop

Compiler Interface • Link time options • -module: Link load balancers as modules • -module EveryLB • Link multiple modules into binary • -balancer GreedyCommLB -balancer RefineLB • -balancer ComboCentLB:GreedyLB,RefineLB

Runtime Options • Run-time options do the same thing, but override the compile time options • +balancer: invoke a load balancer • Can have multiple load balancers • +balancer GreedyCommLB +balancer RefineLB

Programmer Control: ReadyLoadBalance() Enable load balancing at specific point Object ready to migrate Re-balance if needed ReadyLoadBalance() called when your chare is ready to be load balanced – load balancing may not start right away ResumeFromSync() called when load balancing for this chare has finished When to Re-balance Load? • Default: Load balancer is periodic • Provide period as a runtime parameter (+LBPeriod)

Thank You! Free source, binaries, manuals, and more information at:http://charm.cs.uiuc.edu/ Parallel Programming Lab at University of Illinois

Basic Charm++ and Load Balancing

Basic Charm++ and Load Balancing

Presentation Transcript

CHARM OVERVIEW

INTRODUCTION TO CHARM

Charm Builder Bracelets

Charm School for Surveyors

t /charm Project

Wizard – Charm Potions

Semileptonic Charm Decays

CHARM

Charm Semileptonic Decay

Open Charm Spectroscopy

Charm Physics - Experimental

Charm ...the issues

Charm @ J-PARC

Rare Charm Decays

CHARM Update

Charm Kingdom

FCNC Charm Decays

News with Charm

the chalk charm

Identifying Charm Bracelets

Photo Charm Necklace

DRAGON CHARM