450 likes | 459 Views
This course outline provides an introduction to parallel computing and message passing techniques, including an overview of parallel machines, programming methods and languages, and applications. It covers topics such as grid computing, paradigms for parallel programming, interprocess communication, and issues in message passing.
E N D
Course Outline • Introduction in algorithms and applications • Parallel machines and architectures Overview of parallel machines Cluster computers (Myrinet), BlueGene • Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level languages: Linda, HPF • Applications N-body problems, search algorithms • Grid computing
Approaches to Parallel Programming • Sequential language + library • MPI, PVM • Extend sequential language • C/Linda, Concurrent C++ • New languages designed for parallel or distributed programming • SR, occam, Ada, Orca
Paradigms for Parallel Programming • Processes + shared variables • Processes + message passing • Concurrent object-oriented languages • Concurrent functional languages • Concurrent logic languages • Data-parallelism (SPMD model) • Advanced communication models - SR and MPI Java - - HPF Linda
Interprocess Communicationand Synchronization based on Message Passing
Overview • Message passing • General issues • Examples: rendezvous, Remote Procedure Calls, Broadcast • Nondeterminism • Select statement • Example language: SR (Synchronizing Resources) • Traveling Salesman Problem in SR • Example library: MPI (Message Passing Interface)
Point-to-point Message Passing • Basic primitives: send & receive • As library routines: • send(destination, & MsgBuffer) • receive(source, &MsgBuffer) • As language constructs • send MsgName(arguments) to destination • receive MsgName(arguments) from source
Issues in Message Passing • Naming the sender and receiver • Explicit or implicit receipt of messages • Synchronous versus asynchronous messages
Direct naming • Sender and receiver directly name each other • S: send M to R • R: receive M from S • Asymmetric direct naming (more flexible): • S: send M to R • R: receive M • Direct naming is easy to implement • Destination of message is know in advance • Implementation just maps logical names to machine addresses
Indirect naming • Indirect naming uses extra indirection level • R: send M to P -- P is a port name • S: receive M from P • Sender and receiver need not know each other • Port names can be moved around (e.g., in a message) • send ReplyPort(P) to U -- P is name of reply port • Most languages allow only a single process at a time to receive from any given port • Some languages allow multiple receivers that service messages on demand -> called a mailbox
Explicit Message Receipt • Explicit receive by an existing process • Receiving process only handles message when it is willing to do so process main() { // regular computation here receive M( ….); // explicit message receipt // code to handle message // more regular computations …. }
Implicit message receipt • Receipt by a new thread of control, created for handling the incoming message int X; process main( ) { // just regular computations, this code can access X } message-handler M( ) // created whenever a message M arrives { // code to handle the message, can also access X }
Threads Threads run in (pseudo-) parallel on the same node Each thread has its own program counter and local variables Threads share global variables X time main M M
Differences (1) • Implicit receipt is used if it’s unknown when a message will arrive; example: request for remote data process main() { int X; while (true) { if (there is a message readX) { receive readX(S); send valueX(X) to S } // regular computations } } int X; process main( ) { // regular computations } int message-handler readX( S) { send valueX(X) to S }
Differences (2) • Explicit receive gives more control over when to accept which messages; e.g., SR allows: • receive ReadFile(file, offset, NrBytes) by NrBytes • // sorts messages by (increasing) 3rd parameter, i.e. small reads go first • MPI has explicit receive (+ polling for implicit receive) • Java has implicit receive: Remote Method Invocation (RMI) • SR has both
Synchronous vs. asynchronous Message Passing • Synchronous message passing: • Sender is blocked until receiver has accepted the message • Too restrictive for many parallel applications • Asynchronous message passing: • Sender continues immediately • More efficient • Ordering problems • Buffering problems
Message ordering • Ordering with asynchronous message passing • SENDER: RECEIVER: • send message(1) receive message(N); print N • send message(2) receive message(M); print M • Messages may be received in any order, depending on the protocol message(1) message(2)
P1 P1 P1 P1 P2 P2 P2 P2 P1 crashes P1 is dead I’m back Regular message P2 is dead Example: AT&T crash Are you still alive? Something’s wrong, I’d better crash!
Message buffering • Keep messages in a buffer until the receive( ) is done • What if the buffer overflows? • Continue, but delete some messages (e.g., oldest one), or • Use flow control: block the sender temporarily • Flow control changes the semantics since it introduces synchronization • S: send zillion messages to R; receive messages • R: send zillion messages to S; receive messages • -> deadlock!
Example communication primitives • Rendezvous (Ada) • Remote Procedure Call (RPC) • Broadcast
Rendezvous (Ada) • Two-way interaction • Synchronous (blocking) send • Explicit receive • Output parameters sent back to caller • Entry = procedure implemented by a task that can be called remotely
Example task SERVER is entry INCREMENT(X: integer; Y: out integer); end; entry call: S.INCREMENT(2, A) -- invoke entry of task S
Accept statement task body SERVER is begin accept INCREMENT(X: integer; Y: out integer) do Y := X + 1; -- handle entry call end; …... end; • Entry call is fully synchronous • Invoker waits until server is ready to accept • Accept statement waits for entry call • Caller proceeds after accept statement has been executed
Remote Procedure Call (RPC) • Similar to traditional procedure call • Caller and receiver are different processes • Possibly on different machines • Fully synchronous • Sender waits for RPC to complete • Implicit message receipt • New thread of control within receiver
Broadcast • Many networks (e.g., Ethernet) support: • broadcast: send message to all machines • multicast: send messages to a set of machines • Hardware multicast is very efficient • Ethernet: same delay as for a unicast • Multicast can be made reliable using software protocols
Nondeterminism • Interactions may depend on run-time conditions • e.g.: wait for a message from either A or B, whichever comes first • Need to express and control nondeterminism • specify when to accept which message • Example (bounded buffer): • do simultaneously • when buffer not full: accept request to store message • when buffer not empty: accept request to fetch message
Select statement • several alternatives of the form: • WHEN condition => ACCEPT message DO statement • Each alternative may • succeed, if condition=true & a message is available • fail, if condition=false • suspend, if condition=true & no message available yet • Entire select statement may • succeed, if any alternative succeeds -> pick one nondeterministically • fail, if all alternatives fail • suspend, if some alternatives suspend and none succeeds yet
Example: bounded buffer in Ada select when not FULL(BUFFER) => accept STORE_ITEM(X: INTEGER) do ‘store X in buffer’ end; or when not EMPTY(BUFFER) => accept FETCH_ITEM(X: out INTEGER) do X := ‘first item from buffer’ end; end select;
Developed at University of Arizona Goals of SR: Expressiveness Many message passing primitives Ease of use Minimize number of underlying concepts Clean integration of language constructs Efficiency Each primitive must be efficient Synchronizing Resources (SR)
Multiple forms of message passing Asynchronous message passing Rendezvous (explicit receipt) Remote Procedure Call (implicit receipt) Multicast Powerful receive-statement Conditional & ordered receive, based on contents of message Select statement Resource = module run on 1 node (uni/multiprocessor) Contains multiple threads that share variables Overview of SR
The send and receive primitives can be combined in all 4 possible ways Orthogonality in SR
Example body S #sender send R.m1 #asynchr. mp send R.m2 # fork call R.m1 # rendezvous call R.m2 # RPC end S body R #receiver proc M2( ) # implicit receipt # code to handle M2 end initial # main process of R do true -> #infinite loop in m1( ) # explicit receive # code to handle m1 ni od end end R
Find shortest route for salesman among given set of cities Each city must be visited once, no return to initial city 7 Traveling Salesman Problem (TSP) in SR New York 2 2 3 1 Chicago Saint Louis 4 3 Miami
Structure the entire search space as a tree, sorted using nearest-city first heuristic n c c c c c s s s s s 3 m m m m m 2 2 1 4 1 3 3 4 3 3 4 4 1 1 Sequential branch-and-bound
Keep track of best solution found so far (the “bound”) Cut-off partial routes >= bound n c c c c c s s s s s 3 m m m m m 2 2 1 4 1 3 3 4 Length=6 3 3 4 4 1 1 Can be pruned Pruning the search tree
Distribute the search tree over the CPUs CPUs analyze different routes Results in reasonably large-grain jobs Parallelizing TSP
n c c c c c s s s s s 3 m m m m m 2 2 1 4 1 3 3 4 3 3 4 4 1 1 Distribution of TSP search tree Subtasks: - New York -> Chicago - New York -> Saint Louis - New York -> Miami CPU 1 CPU 2 CPU 3
Static distribution: each CPU gets a fixed part of the tree Load balancing problem: subtrees take different amounts of time n c c c c c s s s s s m m m m m 1 Distribution of the tree (2) 3 2 2 1 4 1 3 3 4 3 3 4 4 1
Master process generates large number of jobs (subtrees) and repeatedly hands them out Worker processes (subcontractors) repeatedly take work and execute it 1 worker per processor General, frequently-used model for parallel processing Dynamic distribution: Replicated Workers Model
Need communication to distribute work Need communication to implement global bound Implementing TSP in SR
Master generates jobs to be executed by workers Not known in advance which worker will execute which job A “mailbox” (port with >1 receivers) would have helped Use intermediate buffer process instead Distributing work workers Master buffer
Problem: the bound is a global variable, but it must be implemented with message passing The bound is accessed millions of times, but updated only when a better route is found Only efficient solution is to manually replicate it Implementing the global bound
Use a BoundManager process to serialize updates Assign(M,3) Update(M,3) M M M M = copy of global Minimum Managing a replicated variable in SR Worker 1 Worker 2 BoundManager M := 3 Process 2 assigns to M Assign: asynchr. + explicit ordered recv. Update: synchr.+implicit recv.+multicast
SR code fragments for TSP body worker var M: int := Infinite # copy of bound sem sema # semaphore proc update(value: int) P(sema) # lock copy M := value V(sema) # unlock end update initial # main code for worker - can read M (using sema) - can use send BoundManager.Assign(value) body BoundManager var M: int := Infinite do true -> # handle requests 1 by 1 in Assign(value) by value -> if value < M -> M := value co(i := 1 to ncpus) # multicast call worker[i].update(value) co fi ni od end BoundManager
n c c c c c s s s s s 3 m m m m m 2 2 1 4 1 3 3 4 3 3 4 4 1 1 Not pruned :-( Search overhead Problem Path with length=6 not yet computed by CPU 1 when CPU 3 starts n->m->s Parallel algorithm does more work than sequential algorithm: search overhead CPU 1 CPU 2 CPU 3
Communication overhead Distribution of jobs + updating the global bound (small overhead) Load imbalances Replicated worker model has automatic load balancing Synchronization overhead Mutual exclusion (locking) needed for accessing copy of bound Search overhead Main performance problem In practice: high speedups possible Performance of TSP in SR