300 likes | 305 Views
This research explores the concept of scoped fences to improve memory reordering in multiprocessors, allowing programmers to specify the scope of memory operations and reducing the number of ordering constraints. It also discusses compiler and hardware support for implementing scoped fences.
E N D
Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh
Reordering in Uniprocessors • Memory operations are reordered to improve performance • Hardware (e.g., store buffer, reorder buffer) • Compiler (e.g., code motion, caching value in register) • No harm as long as dependences are respected a1: St x a2: Ld y a2: Ld y a1: St x
Reordering in Multiprocessors • counter-intuitive program behavior Initially x=y=0 a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; P1P2 a2: y = 1; a1: x = 1; b1: Ry = y; b2: Rx = x; b1: Ry = y; b2: Rx = x; a1: x = 1; b1: Ry = y; a1: x = 1; b2: Rx = x; a2: y = 1; b2: Rx = x; a2: y = 1; a2: y = 1; Intuitively, y=1 x=1 Ry=1 Rx=1 (Rx=0, Ry =0) a1: x = 1; (Rx=0, Ry =1) (Rx=1, Ry =0) a2: y = 1; (Rx=1, Ry =1)
Reordering in Multiprocessors • counter-intuitive program behavior Initially p=NULL, flag = false P1P2 p = new A(…) if (flag) a = p->var; flag = true; flag is supposed to be set after p is allocated
Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence P1 p = new A(…) FENCE flag = true;
Fence Instructions • Memory Consistency Models • Specify what reordering is allowed • e.g., SC, TSO (x86, SPARC), RMO (ARM, PowerPC) • Fence Instructions (Fences/Memory barriers) • Selectively override default relaxed memory order • Order memory operations before and after the fence • Inevitable -- building concurrent implementations (e.g., mutual exclusion, queues) [Attiya et. al., POPL’11] • Expensive -- Cilk-5’s THE protocol spends 50% of its time executing a memory fence [Frigo et. al., PLDI’98]
Motivation • Not all memory orderings enforced by fences are necessary • Fences are usually used to enforce some specific memory operations • Programmers know better how a fence is used, which can be conveyed to the hardware Control Data Access Concurrent algorithm Process Data
Scoped Fence (S-Fence) • A S-Fence only orders memory operations in the scope • Scope definition (Class scope, Set scope) • Bridge the gap between programmers’ intention and hardware execution • Programmers specify the scope • Scope information is conveyed to hardware, imposing fewer ordering constraints • Lightweight hardware and compiler support
Scoped Fence (S-Fence) • Programming support S-FENCE global scope S-FENCE[class] class scope S-FENCE[set, {var1, var2, …}] set scope
Work-Stealing Queue Algorithm • TASK take ( ){ • tail = TAIL – 1; • TAIL = tail; • FENCE // store-load • head = HEAD; • if (tail<head){ • TAIL = head; • return EMPTY; • } • … … • return task • } • void put (TASK task){ • tail = TAIL; • wsq[tail] = task; • FENCE // store-store • TAIL = tail+1; • } • TASKsteal ( ){ • head = HEAD; • tail = TAIL; • … … • return task; • } Chase-Lev lock-free concurrent work-stealing queue
Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① FENCE • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ FENCE (a) (b)
Class Scope • S-FENCE[class] class scope • Make use of class in OO languages to illustrate the concept • Constrain a fence to the object class where it is used (Encapsulation) • Intuition: function members operate on data members of the class
Class Scope • S-FENCE[class] class scope class B { int n1, n2; void funcB() { n1 = val3; S-FENCE2[class] n2 = val4; } } class A { B b; int m1, m2; void funcA() { m1 = val1; b.funcB(); S-FENCE1[class] m2 = val2; } } S-FENCE1: m1, m2, n1, n2 S-FENCE2: n1, n2
Class Scope Semantics More details in paper
Parallel Spanning Tree • tail = TAIL – 1; • TAIL = tail; • FENCE • head = HEAD; • …… • color[task’] = label; • parent[task’] = task; • tail = TAIL; • wsq[tail] = task’; • FENCE • TAIL = tail + 1; ① SFENCE[class] • task = wsq.take(); • for (each neighbor task’ of task) • if (task’ is not processed){ • process(task’); • wsq.put(task’) ; • } ② ③ SFENCE[class] (a) (b)
Compiler Support • ISA Extension • class-fence • fs_start – start of a fence scope • fs_end – end of a fence scope Use fs_start and fs_end to embrace functions containing fences • Informing hardware to mark memory operations properly
Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope
Hardware Support Reorder Buffer Store Buffer ... ... • Fence Scope Bits (FSB) • Each entry of ROB and store buffer is associated with FSB • Flag whether a memory operation is in the scope of some fence Fence Scope Bits (FSB) • Decoding - memory operations in the scope are marked via FSB • Fence issue - check the entry for current scope
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 fs_start b outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3
Hardware Support • Setting Fence Bits • FSS: stack to record scope FSB fs_start a I0 I1 I2 I3 I4 I5 I6 I7 Issue Fence fs_start b • by checking FSB on the current scope outer inner fs_end b fs_end a 0 1 2 3
Why S-Fence performs Better? St A St A St A St A St X St X Store Buffer drained & Fence issued stall stall stall Traditional Fence ...... SB Ld Y ROB St B 0 1 2 3 4 St A St X Timeline FENCE stall St A : a cache miss Scoped Fence Ld Y SB St B Ld Y ROB St B
Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section FENCE FENCE
Set Scope • Dekker algorithm Initially flag1 = flag2 = 0 P1P2 m1 = … m2 = … flag1 = 1; flag2 = 1; if (flag2 == 0) if (flag1 == 0) critical sectioncritical section S-FENCE … S-FENCE[set, {flag1, flag2}]
Set Scope • S-FENCE[set, {var1, var2, …}] set scope • only order memory accesses to {var1, var2, …} • Compiler and Hardware Supports • flag memory accesses to the specified variables • set fence scope bits in hardware for flagged memory accesses • For simplicity, we do not differentiate memory accesses to different sets
Experimental Evaluation • Cycle-accurate simulation (SESC) • Integrate scoped fence logic • RMO memory model • Benchmarks • pst - parallel spanning tree (work-stealing queue, class scope) • ptc – parallel transitive closure (work-stealing queue, class scope) • barnes – from SPLASH2 (fences inserted for SC, set scope) • radiosity – from SPLASH2 (fences inserted for SC, set scope)
Experimental Evaluation Traditional fence (T) vs. Scoped fence (S) set scope class scope ~13% Fence Stall Reduced ~50% ~40-50%
Conclusion • Introduce the concept of fence scope • Propose class scope and set scope • OpenCL 2.0 (sub-group, work-group, device, system) • Lightweight compiler and hardware support • No change in inter-processor communication Fence scope should be implemented in some form !
Fence Scoping Changhui Lin†, Vijay Nagarajan*, Rajiv Gupta† † University of California, Riverside * University of Edinburgh