Are We Trading Consistency Too Easily? A Case for Sequential Consistency

Madan Musuvathi Microsoft Research Dan Marino Todd Millstein Abhay Singh Satish Narayanasamy UCLA University of Michigan Are We Trading Consistency Too Easily? A Case for Sequential Consistency

Memory Consistency Model • Abstracts the program runtime (compiler + hardware) • Hides compiler transformations • Hides hardware optimizations, cache hierarchy, … • Sequential consistency (SC) [Lamport ‘79] “The result of any execution is the same as if the operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”

Sequential Consistency Explained int X = F = 0; // F = 1 implies X is initialized X = 1; F = 1; t = F; u = X; X = 1; X = 1; X = 1; t = F; t = F; t = F; X = 1; F = 1; u = X; t = F; t = F; X = 1; F = 1; F = 1; X = 1; u = X; t = F; u = X; t=1, u=1 t=0, u=1 t=0, u=1 t=0, u=0 t=0, u=1 t=0, u=1 F = 1; F = 1; F = 1; u = X; u = X; u = X; t=1 implies u=1

Conventional Wisdom • SC is slow • Disables important compiler optimizations • Disables important hardware optimizations • Relaxed memory models are faster

Conventional Wisdom X • SC is slow • Hardware speculation can hide the cost of SC hardware [Gharachorloo et.al. ’91, … , Blundell et.al. ’09] • Compiler optimizations that break SC provide negligible performance improvement [PLDI ’11] • Relaxed memory models are faster • Need fences for correctness • Programmers conservatively add more fences than necessary • Libraries use the strongest fence necessary for all clients • Fence implementations are slow • Efficient fence implementations require speculation support ?

Implementing Sequential Consistency Efficiently asm: moveax [X]; src: t = X; This talk SC-Preserving Compiler Every SC behavior of the binary is a SC behavior of the source SC Hardware Every observed runtime behavior is a SC behavior of the binary

Challenge: Important Compiler Optimizations are not SC-Preserving • Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t;

Common Subexpression Elimination is not SC-Preserving Init: X = Y = 0; Init: X = Y = 0; L1: t = X*5; L2: u = Y; L3: v = X*5; M1: X = 1; M2: Y = 1; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; possibly u == 1 && v == 0 u == 1 implies v == 5

Implementing CSE in a SC-Preserving Compiler L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; • Enable this transformation when • X is a local variable, or • Y is a local variable • In these cases, the transformation is SC-preserving • Identifying local variables: • Compiler generated temporaries • Stack allocated variables whose address is not taken

A SC-preserving LLVM Compiler for C programs • Modify each of ~70 phases in LLVM to be SC-preserving • Without any additional analysis • Enable trace-preserving optimizations • These do not change the order of memory operations • e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… • Enable transformations on local variables • Enable transformations involving a single shared variable • e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

Average Performance overhead is ~2% 173 480 373 237 132 200 116 159 298 154 • Baseline: LLVM –O3 • Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

How Far Can A SC-Preserving Compiler Go? float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } no opt. SC pres float s, *x, *y; float *px, *py, *e, t; s=0;py=y; e = &x[n] for(px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } float s, *x, *y; float *px, *py, *e; s=0;py=y; e = &x[n] for(px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } full opt

We Can Reduce the FaceSim Overhead (if we cheat a bit) • 30% overhead comes from the inability to perform CSE in • But argument evaluation in C is nondeterministic • The specification explicitly allows overlapped evaluation of function arguments return MATRIX_3X3<T>( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );

Improving Performance of SC-Preserving Compiler • Request programmers to reduce shared accesses in hot loops • Use sophisticated static analysis • Infer more thread-local variables • Infer data-race-free shared variables • Use program annotations • Requires changing the program language • Minimum annotations sufficient to optimize the hot loops • Perform load-optimizations speculatively • Hardware exposes speculative-load optimization to the software • Load optimizations reduce the max overhead to 6%

Conclusion • Hardware should support strong memory models • TSO is efficiently implementable [Mark Hill] • Speculation support for SC over TSO is not currently justifiable • Can we quantify the programmability cost for TSO? • Compiler optimizations should preserve the hardware memory model • High-level programming models can abstract TSO/SC • Further enable compiler/hardware optimizations • Improve programmer productivity, testability, and debuggability

Eager-Load Optimizations L1: t = X*5; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: L2: for(…) L3: t = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = 10; L1: u = X*5; L2: for(…) L3: t = u; • Eagerly perform loads or use values from previous loads or stores Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

Performance overhead 173 480 373 237 132 200 116 159 298 154 Allowing eager-load optimizations alone reduces max overhead to 6%

Correctness Criteria for Eager-Load Optimizations • Eager-loads optimizations rely on a variable remaining unmodified in a region of code • Sequential validity: No mods to X by the current thread in L1-L3 • SC-preservation: No mods to X by any other thread in L1-L3 Enable invariant “t == 5.X” L1: t = X*5; L2: *p = q; L3: v = X*5; Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t;

Speculatively Performing Eager-Load Optimizations • On monitor.load, hardware starts tracking coherence messages on X’s cache line • The interference check fails if X’s cache line has been downgraded since the monitor.load • In our implementation, a single instruction checks interference on up to 32 tags L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5;

Are We Trading Consistency Too Easily? A Case for Sequential Consistency