440 likes | 450 Views
Diffracting Trees and Layout. Multiprocessor Synchronization Nir Shavit Spring 2003. Recall. Theorem: there exists a counting network of q (log w) depth. Unfortunately…non-constructive and contstants in the 1000s. Can we get log depth in practice?. 1. 1. b. 2. 3. 1. 3. b. b.
E N D
Diffracting Trees and Layout Multiprocessor Synchronization Nir Shavit Spring 2003
Recall Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. Can we get log depth in practice? M. Herlihy & N. Shavit (c) 2003
1 1 b 2 3 1 3 b b 3 b 3 1 2 2 b b 2 b Yes: Diffracting Trees M. Herlihy & N. Shavit (c) 2003
Counting Trees A Balancer: x0+x1 y0= 5 1 3 2 • x= b 5 4 3 2 1 4 2 x0+x1 y1= 2 A counting Tree: b b b b b b b Has step property in any quiescent state : 0 <= yi - yj <= 1 for any i < j. M. Herlihy & N. Shavit (c) 2003
A Counting Tree inc = follow fetch&complement of toggle-bits b b b step property M. Herlihy & N. Shavit (c) 2003
Inductive construction: k odd outputs T[k] T[2k] = b k even outputs T[k] . . . . . . General Tree Construction M. Herlihy & N. Shavit (c) 2003
. . . . . . Why it Counts Lemma 1: T[2k] has step property in quiescent state. y0 T[2k] = k even outputs T0[k] b k odd outputs T1[k] y1 M. Herlihy & N. Shavit (c) 2003
problem: toggle bit in balancer.. 1 1 0/1 b 2 1 3 0/1 b 0/1 b 3 0/1 0/1 2 3 1 2 0/1 0/1 b b 2 0/1 b Back to Square One M. Herlihy & N. Shavit (c) 2003
Diffraction If an even number of processes pass balancer, toggle bit remains unchanged! Prism Array toggle bit 0/1 M. Herlihy & N. Shavit (c) 2003
B2 prism 1 1 B1 2 0/1 . . 3 k / 2 prism 1 1 2 2 Diff-Bal . . B3 0/1 : : 3 prism k 2 1 Diff-Bal 2 0/1 . . k / 2 Diff-Bal A Diffracting Tree Lemma: A Diffracting balancer is a balancer. M. Herlihy & N. Shavit (c) 2003
The Benefits High load Lots of Diffraction + Few Toggles Low load Low Diffraction + Few Toggles High Throughput with Low Contention M. Herlihy & N. Shavit (c) 2003
. . . . Shared-Memory: Diffraction Part 1 select index uniformly at random, swap(prism[m],i) prism north bal 1 2 Pj 0/1 i m Pi south bal k location array (one per tree) b b ... 1 2 ... n ... i j M. Herlihy & N. Shavit (c) 2003
The Diffracting Balancer public class balancer { public integer size; public prism[size] RMWRegister; public integer spin; public MCSLock lock; public boolean toggle; public next[] balancer; public boolean isleaf; public synchronized boolean flip() { boolean result = toggle; toggle = !toggle; return result;} } public location[NUMPROCS] RMWRegister(EMPTY); M. Herlihy & N. Shavit (c) 2003
Diffracting Balancer Part 1 public balancer traverse() { integer mypid = Thread.myIndex(); balancer b = this; if (b.isLeaf) return this; location[mypid] := b; integer place = random(1,b.size); integer him = b.prism[place].SWAP(mypid) if (not_empty(him)) { if (location[mypid].CAS(b,EMPTY)){ if (location[him].CAS(b,EMPTY)){ return b.next[0].traverse() } else location[mypid] = b; else return b.next[1]; }} if leaf done Get thread and balancer ids initialize location entry Try to unset both entries random prism entry swap id in get other your own succeeded but failed other, reset and go to part 2 if he is at same balancer if success go north if failed even your own then Other succeeded Go south M. Herlihy & N. Shavit (c) 2003
Diffraction Part 2 prism north bal select index uniformly at rondom follow toggle 0/1 i Pi south bal b i Pi wait s steps M. Herlihy & N. Shavit (c) 2003
Diffracting Balancer Part 2 while (true){ for (int j=0; j<b.spin; j++){ if (location[mypid] != b){ return b.next[1].traverse(); }} if b.lock.acquire(){ if (location[mypid].CAS(b,EMPTY)) { integer i = b.toggle.flip(); b.lock.release(); return b.next[i].traverse()} else { b.lock.release(); return b.next[1].traverse(); }}}} otherwise grab lock if diffracted go south spin fixed amount waiting to be diffracted by other if still not diffracted atomically reset location entry and toggle release lock and follow bit otherwise you were diffracted, release lock and go south M. Herlihy & N. Shavit (c) 2003
Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } } M. Herlihy & N. Shavit (c) 2003
Diffracting Tree Throughput M. Herlihy & N. Shavit (c) 2003
Diffracting Tree Latency M. Herlihy & N. Shavit (c) 2003
# diff r = # togg 1 p ( fraction toggling ) ( ) total # Toggling (in short period) = < = 1/ c ~ 10 = p r +1 = cp 1/10 for top balancer Diffraction Rate The number of toggling processes remains constant as concurrency increases. p = # concurrent processors M. Herlihy & N. Shavit (c) 2003
Diffracting Tree by Size M. Herlihy & N. Shavit (c) 2003
Latency (Work = 1000) M. Herlihy & N. Shavit (c) 2003
Message Passing Implementation M. Herlihy & N. Shavit (c) 2003
P13 P1 P5 P9 Single wire Torus Mesh Mesh with 5x5 X-bar swithches P13 P1 P5 P9 P14 P10 P2 P6 P15 P3 P7 P11 P14 P10 P2 P6 P16 P4 P8 P12 NxN X-bar P15 P3 P7 P11 P16 P4 P12 P8 Locality vs. Bandwidth Butterfly Low Locality High Locality Low Bandwidth High Bandwidth M. Herlihy & N. Shavit (c) 2003
Optimally Placed Combining Tree Effects of Placement on Throughput Trees placed on Mesh with single wire switches M. Herlihy & N. Shavit (c) 2003
Message Passing Pool Benchmark Work=100 Work=1000 Throughput on single wire mesh M. Herlihy & N. Shavit (c) 2003
Latency on Mesh With Single Wire Switches Work=100 Work=500 M. Herlihy & N. Shavit (c) 2003
Butterfly (Work=0) No locality to help combining tree M. Herlihy & N. Shavit (c) 2003
Full X-Bar (Work=0) Again no locality to help combining tree M. Herlihy & N. Shavit (c) 2003
Mesh X-Bar Switches (Work=0) More locality to make them closer in performance M. Herlihy & N. Shavit (c) 2003
DTree W=16 DTree W=32 DTree W=16 DTree W=8 CNet W=32 CNet W=32 CNet W=32 CNet W=16 Single wire Torus Mesh Mesh with 5x5 X-bar swithches NxN X-bar DTree & CNet: Locality vs. Bandwidth Butterfly Low Locality High Locality Low Bandwidth High Bandwidth M. Herlihy & N. Shavit (c) 2003
Deq( ) Enq(x) P3 P1 . . P4 P2 . . . . Enq(y) Deq( ) Pn P7 Deq( ) Enq(z) pool Re-think Pools Queue = FIFO Pool Stack = LIFO Pool M. Herlihy & N. Shavit (c) 2003
. . . . . . y Pool Using Counter (Stack-like) Deq(y) Pn Stackm Enq(y) 1 C P2 Stack2 head 0 P1 x Stack1 Enq(x) Enq(value) :: i = F&Inc(head) Push value on Stacki+1 Deq() :: i = F&Dec(head) Pop value from Stacki M. Herlihy & N. Shavit (c) 2003
. . . An Inc/Dec Counter inc P1 dec P2 shared counter 0,1,2,1,2,3,2,... Pn dec No duplication: returned values are unique No omission: largest value <= num of requests. M. Herlihy & N. Shavit (c) 2003
Inc/Dec Counting Tree inc = follow fetch&complement dec = follow complement&fetch 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 M. Herlihy & N. Shavit (c) 2003
Inc/Dec Diffracting Tree Gap Balancer: 0 < (y0- y0) - (y0- y1) < 1 b M. Herlihy & N. Shavit (c) 2003
Inc/Dec Counting Tree By Lemma 1: Tree has gap step property 0 < (yi- yi) - (yj - yj)< 1 for any i < j. b b b b b b b M. Herlihy & N. Shavit (c) 2003
Inc/Dec Balancing 1 2 F&C . . 0/1 C&F : : k Lemma: A Diffracting balancer is a gapbalancer. What about colliding Token and Anti-Token? M. Herlihy & N. Shavit (c) 2003
An Elimination Tree An elimination balancer: Colliding Token and Anti-Token exchange value. Deq() x 1 2 . F&C . 0/1 C&F : Enq(x) : k Stack E-balancer ok Lemma: A Counting Tree of eliminationbalancers has the gap step property. M. Herlihy & N. Shavit (c) 2003
0/1 0/1 0/1 b 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Why Elimination works Lemma: A Counting Tree of elimination balancers has the gap step property. Proof: ... an execution in which they never diffract and toggle the bits one after the other, and finally ``eliminate’’ at leaf. Is equivalent to.... • = 0/1 b 0/1 0/1 M. Herlihy & N. Shavit (c) 2003
B2 Increment: 1 prism 1 B1 2 0/1 . . prism k / 2 i 1 2 . Diff-Bal . 0/1 : : j k prism 1 Diff-Bal 2 0/1 . . k / 2 n Diff-Bal B3 location array (one per tree) B2 B1 B1 B3 1 2 ... i ... ... j ... n Shared Memory Imp. Elimination: exchange values by reading/writing the location array. M. Herlihy & N. Shavit (c) 2003
Throughput Latency Etree MCS Ctree 160000 10000 140000 120000 8000 100000 6000 80000 Dtree Dtree 4000 60000 40000 2000 Etree Ctree 20000 MCS 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 P=Concurrency P=Concurrency Performance M. Herlihy & N. Shavit (c) 2003
latency = diffracting = (P/w log w)f(w,P) where log w < f(w,P) < sqrt w For n=256 proc: depth(E/Dtree[32]) = 5 depth(Ctree[n]) = 8 x 2 traversals = 16 Why is Elimination Fast? M. Herlihy & N. Shavit (c) 2003
Procs. 256 16 Level 0 44.7% 49.8% Level 1 24% 49.1% Level 2 5.8% 45.2% 1.9% 32.9% Level 3 Level 4 0% 6.8% Why is Elimination Fast? (2) At high load the percent eliminated per level: Expected # of balancers traversed (Incuding stacks): 3.14 for n=16 and 2.1 for n=256! M. Herlihy & N. Shavit (c) 2003