170 likes | 356 Views
War of the Worlds -- Shared-memory vs. Distributed-memory. In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate by exchanging messages We do not have shared memory Communication is much more expensive
E N D
War of the Worlds -- Shared-memory vs. Distributed-memory • In distributed world, we have heavyweight processes (nodes) rather than threads • Nodes communicate by exchanging messages • We do not have shared memory • Communication is much more expensive • Sending a message takes much more time than sending data through a channel • Possibly non-uniform communication • We only have 1-to-1 communication (no many-to-many channels)
Initial Distributed-memory Settings • We consider settings where there is no multithreading within a single MPI node • We consider systems where communication latency between different nodes is • Low • Uniform
Good Shared Memory Orbit Version Hash Server Thread 1 Hash Server Thread 2 Hash Server Thread 3 {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} O1 O2 O3 [x1,x2, …,xm] Shared Task Pool f1 f1 f1 f1 f2 f2 f2 f2 f3 f3 f3 f3 WorkerThreads f4 f4 f4 f4 f5 f5 f5 f5
Why is this version hard to port to MPI? • Singe task pool! • Requires a shared structure to which all of the hash servers write data, and all of the workers read data from • Not easy to implement using MPI, where we only have 1-to-1 communication • We could have a dedicated node which will hold task queue • Workers send messages to it to request work • Hash servers send messages to it to push work • This would make the node potential bottleneck, and would involve a lot of communication
MPI Version 1 • Maybe merge workers and hash servers? • Each MPI node acts both as a hash server and as a worker • Each node has its own task pool • If task pool of a node is empty, the node tries to steal work from some other node
MPI Version 1 {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} f1 f1 f1 f2 f2 f2 f3 f3 f3 f4 f4 f4 f5 f5 f5 [x11,x12,…x1m1] [x21,x22,…x2m2] [x31,x32,…x3m3] MPI Nodes
MPI Version 1 is Bad! • Bad performance, especially for smaller number of nodes • Same process does hash table lookups, and applies generator functions to points • It cannot do both at the same time => something has to wait • This creates contention
MPI Version 2 • Separate hash servers and workers, after all • Hash server nodes • Keep parts of the hash table • Also keep parts of task pool • Worker nodes just apply generators to points • Workers obtain work from hash server nodes using work-stealing
MPI Version 2 Hash Server nodes {z1,z4,z5,…, zl1} {z2,z3,z8,…, zl2} {z6,z7,z9,…, zl3} O2 O3 O1 [x11,x12,…x1m1] [x21,x22,…x2m2] [x31,x32,…x3m3] T1 T2 T3 f1 f1 f1 f1 f2 f2 f2 f2 Workernodes f3 f3 f3 f3 f4 f4 f4 f4 f5 f5 f5 f5
MPI Version 2 • Much better performance than MPI Version 1 (on low-latency systems) • Key thing is separating hash lookup and applying generators to points in different nodes
Big Issue with MPI Versions 1 and 2 -- Detecting Termination! • We need to detect the situation where all of the hash server nodes have empty task pools, and where no new work will be produced by hash servers! • Even detecting that all task pools are empty and all hash servers and all workers are idle is not enough, as there may be messages flying around that will create more work! • Woe unto me! What are we to do? • Good ol’ Dijkstra comes to rescue - We use a variant of Dijkstra-Scholten Termination Detection Algorithm
Termination Detection Algorithm • Each hash server keeps two counters • Number of points sent (my_nr_points_sent) • Number of points received (my_nr_points_rcvd) • We enumerate hash servers - H0 … Hn • Hash server H0, when idle, sends a token to the hash server H1 • It attaches a token count (my_nr_points_sent, my_nr_points_rcvd) to the token • When a hash server Hi receives a token • If it is active (has tasks in the task pool), sends the token back to H0 • If it is idle, it increases each component of the count attached to the token and sends the token to Hi+1 • If received token count was (pts_sent, pts_rcvd), the new token count is (my_nr_points_sent + pts_sent, my_nr_points_rcvd + pts_rcvd) • If H0 receives the token, and if token count is (pts_sent, pts_rcvd) such that pts_rcvd = num_gens * pts_sent, then termination is detected
MPIGAP Code for MPI Version 2 • Not trivial (~400 lines of GAP code) • Explicit message passing using low-level MPI bindings • This version is hard to implement using task abstraction
MPIGAP Code for MPI Version 2 Worker := function(gens,op,f) local g,j,n,m,res,t,x,toSend,idle; n := nrHashes; while true do t := GetWork(); if IsIdenticalObj (t, fail) then return; fi; m := QuoInt(Length(t)*Length(gens)*2,n); res := List([1..n],x->EmptyPlist(m)); for j in [1..Length(t)] do for g in gens do x := op(t[j],g); Add(res[f(x)],x); od; od; for j in [1..n] do if Length(res[j]) > 0 then OrbSendMessage(res[j],minHashId+j-1); fi; od; od; end;
MPIGAP Code for MPI Version 2 GetWork := function() local msg, target; tid := minHashId; OrbSendMessage(["getwork",processId],tid); msg := OrbGetMessage(true); if msg[1]<>"finish" then return msg; else return fail; fi; end;
MPIGAP Code for MPI Version 2 OrbGetMessage := function(blocking) local test, msg, tmp, veg; if blocking then test := MPI_Probe(); else test := MPI_Iprobe(); fi; if test then msg := UNIX_MakeString(MPI_Get_count()); MPI_Recv(msg); tmp := DeserializeNativeString(msg); totalProcTime := totalProcTime + veg; else return fail; fi; end; OrbSendMessage := function(raw,dest) local msg; msg := SerializeToNativeString(raw); MPI_Binsend(msg,dest,Length(msg)); end;
Work in Progress - Extending MPI Version 2 To Systems With Non-Uniform Latency • Communication latencies between nodes might be different • Where to place hash server nodes? And how many? • How to do work distribution? • Is work stealing still a good idea in a setting where communication distance between a worker and different hash servers is not uniform? • We can look at the Shared memory + MPI world as a special case of this • Multithreading within MPI nodes • Threads from the same node can communicate fast • Nodes communicate much slower