1 / 44

Diffracting Trees and Layout

Diffracting Trees and Layout. Multiprocessor Synchronization Nir Shavit Spring 2003. Recall. Theorem: there exists a counting network of q (log w) depth. Unfortunately…non-constructive and contstants in the 1000s. Can we get log depth in practice?. 1. 1. b. 2. 3. 1. 3. b. b.

Download Presentation

Diffracting Trees and Layout

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Diffracting Trees and Layout Multiprocessor Synchronization Nir Shavit Spring 2003

  2. Recall Theorem: there exists a counting network of q(log w) depth. Unfortunately…non-constructive and contstants in the 1000s. Can we get log depth in practice? M. Herlihy & N. Shavit (c) 2003

  3. 1 1 b 2 3 1 3 b b 3 b 3 1 2 2 b b 2 b Yes: Diffracting Trees M. Herlihy & N. Shavit (c) 2003

  4. Counting Trees A Balancer: x0+x1 y0= 5 1 3 2 • x= b 5 4 3 2 1 4 2 x0+x1 y1= 2 A counting Tree: b b b b b b b Has step property in any quiescent state : 0 <= yi - yj <= 1 for any i < j. M. Herlihy & N. Shavit (c) 2003

  5. A Counting Tree inc = follow fetch&complement of toggle-bits b b b step property M. Herlihy & N. Shavit (c) 2003

  6. Inductive construction: k odd outputs T[k] T[2k] = b k even outputs T[k] . . . . . . General Tree Construction M. Herlihy & N. Shavit (c) 2003

  7. . . . . . . Why it Counts Lemma 1: T[2k] has step property in quiescent state. y0 T[2k] = k even outputs T0[k] b k odd outputs T1[k] y1 M. Herlihy & N. Shavit (c) 2003

  8. problem: toggle bit in balancer.. 1 1 0/1 b 2 1 3 0/1 b 0/1 b 3 0/1 0/1 2 3 1 2 0/1 0/1 b b 2 0/1 b Back to Square One M. Herlihy & N. Shavit (c) 2003

  9. Diffraction If an even number of processes pass balancer, toggle bit remains unchanged! Prism Array toggle bit 0/1 M. Herlihy & N. Shavit (c) 2003

  10. B2 prism 1 1 B1 2 0/1 . . 3 k / 2 prism 1 1 2 2 Diff-Bal . . B3 0/1 : : 3 prism k 2 1 Diff-Bal 2 0/1 . . k / 2 Diff-Bal A Diffracting Tree Lemma: A Diffracting balancer is a balancer. M. Herlihy & N. Shavit (c) 2003

  11. The Benefits High load Lots of Diffraction + Few Toggles Low load Low Diffraction + Few Toggles High Throughput with Low Contention M. Herlihy & N. Shavit (c) 2003

  12. . . . . Shared-Memory: Diffraction Part 1 select index uniformly at random, swap(prism[m],i) prism north bal 1 2 Pj 0/1 i m Pi south bal k location array (one per tree) b b ... 1 2 ... n ... i j M. Herlihy & N. Shavit (c) 2003

  13. The Diffracting Balancer public class balancer { public integer size; public prism[size] RMWRegister; public integer spin; public MCSLock lock; public boolean toggle; public next[] balancer; public boolean isleaf; public synchronized boolean flip() { boolean result = toggle; toggle = !toggle; return result;} } public location[NUMPROCS] RMWRegister(EMPTY); M. Herlihy & N. Shavit (c) 2003

  14. Diffracting Balancer Part 1 public balancer traverse() { integer mypid = Thread.myIndex(); balancer b = this; if (b.isLeaf) return this; location[mypid] := b; integer place = random(1,b.size); integer him = b.prism[place].SWAP(mypid) if (not_empty(him)) { if (location[mypid].CAS(b,EMPTY)){ if (location[him].CAS(b,EMPTY)){ return b.next[0].traverse() } else location[mypid] = b; else return b.next[1]; }} if leaf done Get thread and balancer ids initialize location entry Try to unset both entries random prism entry swap id in get other your own succeeded but failed other, reset and go to part 2 if he is at same balancer if success go north if failed even your own then Other succeeded Go south M. Herlihy & N. Shavit (c) 2003

  15. Diffraction Part 2 prism north bal select index uniformly at rondom follow toggle 0/1 i Pi south bal b i Pi wait s steps M. Herlihy & N. Shavit (c) 2003

  16. Diffracting Balancer Part 2 while (true){ for (int j=0; j<b.spin; j++){ if (location[mypid] != b){ return b.next[1].traverse(); }} if b.lock.acquire(){ if (location[mypid].CAS(b,EMPTY)) { integer i = b.toggle.flip(); b.lock.release(); return b.next[i].traverse()} else { b.lock.release(); return b.next[1].traverse(); }}}} otherwise grab lock if diffracted go south spin fixed amount waiting to be diffracted by other if still not diffracted atomically reset location entry and toggle release lock and follow bit otherwise you were diffracted, release lock and go south M. Herlihy & N. Shavit (c) 2003

  17. Index Distribution Benchmark void indexBench(int iters, int work) { while (int i = 0 < iters) { i = fetch&inc(); Thread.sleep(random() % work); } } M. Herlihy & N. Shavit (c) 2003

  18. Diffracting Tree Throughput M. Herlihy & N. Shavit (c) 2003

  19. Diffracting Tree Latency M. Herlihy & N. Shavit (c) 2003

  20. # diff r = # togg 1 p ( fraction toggling ) ( ) total # Toggling (in short period) = < = 1/ c ~ 10 = p r +1 = cp 1/10 for top balancer Diffraction Rate The number of toggling processes remains constant as concurrency increases. p = # concurrent processors M. Herlihy & N. Shavit (c) 2003

  21. Diffracting Tree by Size M. Herlihy & N. Shavit (c) 2003

  22. Latency (Work = 1000) M. Herlihy & N. Shavit (c) 2003

  23. Message Passing Implementation M. Herlihy & N. Shavit (c) 2003

  24. P13 P1 P5 P9 Single wire Torus Mesh Mesh with 5x5 X-bar swithches P13 P1 P5 P9 P14 P10 P2 P6 P15 P3 P7 P11 P14 P10 P2 P6 P16 P4 P8 P12 NxN X-bar P15 P3 P7 P11 P16 P4 P12 P8 Locality vs. Bandwidth Butterfly Low Locality High Locality Low Bandwidth High Bandwidth M. Herlihy & N. Shavit (c) 2003

  25. Optimally Placed Combining Tree Effects of Placement on Throughput Trees placed on Mesh with single wire switches M. Herlihy & N. Shavit (c) 2003

  26. Message Passing Pool Benchmark Work=100 Work=1000 Throughput on single wire mesh M. Herlihy & N. Shavit (c) 2003

  27. Latency on Mesh With Single Wire Switches Work=100 Work=500 M. Herlihy & N. Shavit (c) 2003

  28. Butterfly (Work=0) No locality to help combining tree M. Herlihy & N. Shavit (c) 2003

  29. Full X-Bar (Work=0) Again no locality to help combining tree M. Herlihy & N. Shavit (c) 2003

  30. Mesh X-Bar Switches (Work=0) More locality to make them closer in performance M. Herlihy & N. Shavit (c) 2003

  31. DTree W=16 DTree W=32 DTree W=16 DTree W=8 CNet W=32 CNet W=32 CNet W=32 CNet W=16 Single wire Torus Mesh Mesh with 5x5 X-bar swithches NxN X-bar DTree & CNet: Locality vs. Bandwidth Butterfly Low Locality High Locality Low Bandwidth High Bandwidth M. Herlihy & N. Shavit (c) 2003

  32. Deq( ) Enq(x) P3 P1 . . P4 P2 . . . . Enq(y) Deq( ) Pn P7 Deq( ) Enq(z) pool Re-think Pools Queue = FIFO Pool Stack = LIFO Pool M. Herlihy & N. Shavit (c) 2003

  33. . . . . . . y Pool Using Counter (Stack-like) Deq(y) Pn Stackm Enq(y) 1 C P2 Stack2 head 0 P1 x Stack1 Enq(x) Enq(value) :: i = F&Inc(head) Push value on Stacki+1 Deq() :: i = F&Dec(head) Pop value from Stacki M. Herlihy & N. Shavit (c) 2003

  34. . . . An Inc/Dec Counter inc P1 dec P2 shared counter 0,1,2,1,2,3,2,... Pn dec No duplication: returned values are unique No omission: largest value <= num of requests. M. Herlihy & N. Shavit (c) 2003

  35. Inc/Dec Counting Tree inc = follow fetch&complement dec = follow complement&fetch 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 M. Herlihy & N. Shavit (c) 2003

  36. Inc/Dec Diffracting Tree Gap Balancer: 0 < (y0- y0) - (y0- y1) < 1 b M. Herlihy & N. Shavit (c) 2003

  37. Inc/Dec Counting Tree By Lemma 1: Tree has gap step property 0 < (yi- yi) - (yj - yj)< 1 for any i < j. b b b b b b b M. Herlihy & N. Shavit (c) 2003

  38. Inc/Dec Balancing 1 2 F&C . . 0/1 C&F : : k Lemma: A Diffracting balancer is a gapbalancer. What about colliding Token and Anti-Token? M. Herlihy & N. Shavit (c) 2003

  39. An Elimination Tree An elimination balancer: Colliding Token and Anti-Token exchange value. Deq() x 1 2 . F&C . 0/1 C&F : Enq(x) : k Stack E-balancer ok Lemma: A Counting Tree of eliminationbalancers has the gap step property. M. Herlihy & N. Shavit (c) 2003

  40. 0/1 0/1 0/1 b 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 Why Elimination works Lemma: A Counting Tree of elimination balancers has the gap step property. Proof: ... an execution in which they never diffract and toggle the bits one after the other, and finally ``eliminate’’ at leaf. Is equivalent to.... • = 0/1 b 0/1 0/1 M. Herlihy & N. Shavit (c) 2003

  41. B2 Increment: 1 prism 1 B1 2 0/1 . . prism k / 2 i 1 2 . Diff-Bal . 0/1 : : j k prism 1 Diff-Bal 2 0/1 . . k / 2 n Diff-Bal B3 location array (one per tree) B2 B1 B1 B3 1 2 ... i ... ... j ... n Shared Memory Imp. Elimination: exchange values by reading/writing the location array. M. Herlihy & N. Shavit (c) 2003

  42. Throughput Latency Etree MCS Ctree 160000 10000 140000 120000 8000 100000 6000 80000 Dtree Dtree 4000 60000 40000 2000 Etree Ctree 20000 MCS 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 P=Concurrency P=Concurrency Performance M. Herlihy & N. Shavit (c) 2003

  43. latency = diffracting = (P/w log w)f(w,P) where log w < f(w,P) < sqrt w For n=256 proc: depth(E/Dtree[32]) = 5 depth(Ctree[n]) = 8 x 2 traversals = 16 Why is Elimination Fast? M. Herlihy & N. Shavit (c) 2003

  44. Procs. 256 16 Level 0 44.7% 49.8% Level 1 24% 49.1% Level 2 5.8% 45.2% 1.9% 32.9% Level 3 Level 4 0% 6.8% Why is Elimination Fast? (2) At high load the percent eliminated per level: Expected # of balancers traversed (Incuding stacks): 3.14 for n=16 and 2.1 for n=256! M. Herlihy & N. Shavit (c) 2003

More Related